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Abstract Recent decades have seen enormous improve¬ 
ments in computational inference for statistical models; 
there have been competitive continual enhancements in 
a wide range of computational tools. In Bayesian infer¬ 
ence, first and foremost, MCMC techniques have con¬ 
tinued to evolve, moving from random walk proposals 
to Langevin drift, to Hamiltonian Monte Carlo, and so 
on, with both theoretical and algorithmic innovations 
opening new opportunities to practitioners. However, 
this impressive evolution in capacity is confronted by an 
even steeper increase in the complexity of the datasets 
to be addressed. The difficulties of modelling and then 
handling ever more complex datasets most likely call 
for a new type of tool for computational inference that 
dramatically reduces the dimension and size of the raw 
data while capturing its essential aspects. Approximate 
models and algorithms may thus be at the core of the 
next computational revolution. 
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1 Introduction 


One may reasonably balk at the terms “computational 
statistics” and “Bayesian computation” since, from its 
very start, statistics has always involved some computa¬ 
tional step to extract information, something manage¬ 
able like an estimator or a prediction, from raw data. 
This necessarily incomplete and unavoidably biased re¬ 
view of the recent past, current state, and immediate 
future of algorithms for Bayesian inference thus first re¬ 
quires us to explain what we mean by computation in a 
statistical context, before turning to what we perceive 
as medium term solutions and possible deadends. 

Computations are an issue in statistics whenever 
processing a dataset becomes a difficulty, a liability, 
or even an impossibility. Obviously, the computational 
challenge varies according to the time when it is faced: 
what was an issue in the 19th century is most likely not 
so any longer (take for instance the derivation of the 
moment estimates of a mixture of two normal distribu¬ 


tions so painstakenly set by Pearson (1894) for estimat 


ing the ratio of “forehead” breadth to body length on 
a dataset of 1,000 crabs or the intense algebraic deriva¬ 
tions found in the analysis of variance of the 1950s and 
1960s dSearle et aI.|[T^| )). 

The introduction of simulation tools in the 1940s 
followed hard on the heels of the invention of the com¬ 
puter and certainly contributed an impetus towards 
faster and better computers, at least in the first decade 
of this revolution. This shows that these tools were 
both needed, and unavailable without electronic calcu¬ 
lators. The introduction of Markov chain Monte Carlo 
is harder to pin down as some partial versions can be 
traced all the way back to 1944-45 and the Manhat¬ 
tan project at Los Alamos (Metro polis||198 7). It is sur¬ 
prisingly much later, i.e., only by the early 1990s, that 
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such methods became part of the Bayesian toolbox, 
that is, some time after the devising of other computer- 
dependent tools like the bootstrap or the EM algo¬ 
rithm, and despite the availability of personal comput¬ 
ers that considerably eased programming and exper¬ 


imenting (Robert and Casella 2011). It is presumably 


pointless to try to attribute this delay to a definite cause 
but a certain lack of probabilistic culture within the 
statistics community is probably partly to blame. 

What makes this time-lag in MCMC methods be¬ 
coming assimilated into the statistics community even 
more surprising is that fact that Bayesian inference 
having a significant role in statistical practice was re¬ 
ally on hold pending the discovery of flexible computa¬ 
tional tools that (implictly or explicitly) delivered val¬ 
ues for the medium- to high-dimensional integrals that 
underpin the calculation of posterior distributions, in 
all but toy problems where conjugacy provided explicit 
answers. In fact, until Bayesians discovered MCMC, 
the only computational methodology that seemed to 
offer much chance of making practical Bayesian statis¬ 
tics practical was the portfolio of quadrature methods 
developed under Adrian Smith’s leadership at Notting¬ 
ham ( Naylor and Smith| 1982 Smith et al.|1985 1987). 

The very first article in the first issue of Statis¬ 
tics and Computing , whose quarter-century we cele¬ 
brate in this special issue, was (to the journal’s credit!) 
on Bayesian analysis, and was precisely in this direc¬ 
tion of using clever quadrature methods to approach 
moderately high-dimensional posterior analysis (Della- 


portas and Wright 1991). By the next (second) issue, 


sampling-based methods had started to appear, with 
three papers out of five in the issue on or related to 
Gibbs sampling (Verdinelli and Wasserman] 1991 Car- 


lin and Gelfand||1991 Wakefield et al.||1991). 

Now, reflecting upon the evolution of MCMC meth¬ 
ods over the 25 or so years they have been at the fore¬ 
front of Bayesian inference, the focus has evolved a long 
way, from hierarchical models that extended the lin¬ 
ear, mixed and generalised linear models ( Albert|[l988 


Carlin et al.||1992 Bennett et al.|[l996 l which were ini¬ 


tially the focus, and graphical models that stemmed 


from image analysis (Geman and Genian 1984) and 


artificial intelligence, to dynamical models driven by 


ODE’s (Wilkinson 2011b) and diffusions (|Roberts and 

Stramer 2001 Dellaportas et al. 2004 

Beskos et al. 

2006), hidden trees (Larget and Simon 

1999 Huelsen- 

beck and Ronquist||200l| Chipman et a 

. 2008f Aldous 

et al. 2008) and graphs, aside with d 

ecision making 
lile research on 
active and con- 

in highly complex graphical models. W. 
MCMC theory and methodology is still 

tinually branching (Papaspiliopoulos et al. 2007; An- 

drieu and Roberts|2009t Latuszyriski et al.||2011[ Douc 


and Robert 201 If) , progressively incorporating the ca¬ 


pacities of parallel processors and GPUs (Lee et al. 


2009 Jacob et al.|2011| Strid|2010 Suchard et al.|2010 


Scott et al. 2013[ Calderhead 2014), we wonder if we 
are not currently facing a new era where those meth¬ 
ods are no longer appropriate to undertake the anal¬ 
ysis of new models, and of new formulations where 
models are no longer completely defined. We indeed 
believe that imprecise models, incomplete information 
and summarised data will become, if not already, a cen¬ 
tral aspect of statistical analysis, due to the massive 
influx of data and the need to provide non-statisticians 
with efficient tools. This is why we also cover in this 
survey the notions of approximate Bayesian computa¬ 
tion (ABC) and comment on the use of optimisation 
tools. 


The plan of the paper is that in Sections 2 and 3 we 
discuss recent progress and current issues in Markov 
chain Monte Carlo and Approximate Bayesian Com¬ 
putation respectively. In Section 4, we highlight some 
araes of modern optimisation that, through lack of fa¬ 
miliarity, are making less impact in the mainstream of 
Bayesian computation than we think justified. Our Dis¬ 
cussion in Section 5 raises issues about data science and 
relevance to applications, and looks to the future. 


2 MCMC, targetting the posterior 

When MCMC techniques were introduced to the main¬ 
stream statistical (Bayesian) community in 1990, they 
were received with skepticism that they could one day 
become the central tool of Bayesian inference. For in¬ 
stance, despite the assurance provided by the ergodic 
theorem, many researchers thought at first that the 
convergence of those algorithms was a mere theoreti¬ 
cal anticipation rather than a practical reality, in con¬ 
trast to traditional Monte Carlo methods, and hence 
that they could not be trusted to provide “exact” an¬ 
swers. This perspective is obviously obsolete by now, 
when MCMC output is considered as “exact” as regular 
Monte Carlo, if possibly less efficient in some settings. 
Nowadays, MCMC is again attracting more attention 
(than in the past decade, say, where developments were 
more about alternatives, some of which described in 
the following sections), both because of methodologi¬ 
cal developments linked to better theoretical tools, for 
instance in the handling of stochastic processes, and 
because of new advances in accelerated computing via 
parallel and cloud computing. 
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2.1 Basics of MCMC 

The introduction of Markov chain based methods within 
Monte Carlo thus took a certain amount of argument 
to reach the mainstream statistical community, when 
compared with other groups who were using MCMC 
methods 10 to 30 years earlier. It may sound unlikely 
at the current stage of our knowledge, but using meth¬ 
ods that (a) generated correlated output, (b) required 
some burnin time to remove the impact of the initial 
distribution and (c) did not lead to a closed form ex¬ 
pression for asymptotic variances were indeed met with 
resistance at first. As often, the immense computing 
advantages offered by this versatile tool soon overcame 
the reluctance to accept those methods as similarly “ex¬ 
act” as other Monte Carlo techniques, applications driv¬ 
ing the move from the early 1990s. We reproduce be¬ 
low the generic version of the “all variables at once” 
Metropolis-Hastings algorithm (Metropolis et al.|1953 


Hastings||1970[ |Besag et al.||1995] ~ Robert and Casella 


2011) as it (still) constitutes in our opinion a fundamen¬ 
tal advance in computational statistics, namely that, 
given a computable density 7r (up to a normalising con¬ 
stant) on 0, and a proposal Markov kernel g(-|-), there 
exists a universal machine that returns a Markov chain 
with the proper stationary distribution, hence an asso¬ 
ciated operational MCMC algorithm. 


variables are updated singly or in blocks according to 
some schedule, remains a keystone in standard use of 
MCMC methodology, even though the newer Hamil¬ 


tonian Monte Carlo approach (see Section 2.31 may 
sooner or later come to replace it. While there is noth¬ 
ing intrinsically unique to the nature of this algorithm, 
or optimal in its convergence properties (other than the 


result of Peskun (1973) on the optimality of the accep¬ 


tance ratio), attempts to bypass Metropolis-Hastings 
are few and limited. For instance, the birtlr-and-death 


process developed by Stephens (2000) used a continu¬ 


ous time jump process to explore a set of models, only 
to be later shown ( Cappe et al.||2002 ) to be equivalent 
to the (Metropolis-Hastings) reversible jump approach 
of Green (199%]). 

Another aspect of the generic Metropolis-Hastings 
that became central more recently is that while the 
accept-reject step does overcome need to know the nor¬ 
malising constant, it still requires 7r, if unnornralised, 
and this may be too expensive to compute or even in¬ 
tractable for complicated models and large datasets. 
Much recent research effort has been devoted to the 
design and understanding of appropriate modifications 
that use estimators or approximations of 7r instead and 
we will take the opportunity to summarise some of the 
progress in this direction. 


Algorithm 1 Metropolis-Hastings algorithm (generic 
version) 

Choose a starting value 0)°)) 
for n = 1 to N do 

Generate 0* from a proposal q(- |00“ D) 

Compute the acceptance probability 

p (ra) = 1 A7r(6''‘)5(e ( "- 1) |ei*))/7r(6l ( "- 1) g(6l*|6i ( "- 1) ) 


Generate u n ~ W(0, 1) and take 00) = 9* if u n < p0), 
00 ) _ $ 0 - 1 ) otherwise. 

end for 


The first observation about the Metropolis Hastings 
is that the flexibility in choosing q is a blessing, but also 
a curse since the choice determines the performance of 
the algorithm. Hence a large part of the research on 
MCMC along the past 30 years (if we arbitrarily set 
the starting date at Geman and Geman (1984)) has 
been on choice of the proposal q to improve the effi¬ 
ciency of the algorithm, and in characterising its con¬ 
vergence properties. This typically requires gathering 
or computing additional information about 7r and we 
discuss some of the fundamental strategies in subse¬ 
quent sections. Algorithm [T] and its variants in which 


2.2 MALA and Manifold MALA 


Stochastic differential equations (SDEs) have been and 
still are informing Monte Carlo development in a num¬ 
ber of seminal ways. A key insight is that the Langevin 
diffusion on 0 solving 

d0 t = log Tr(d t )dt + dB t (1) 


has 7r as its stationary and limiting distribution. Here 
B t is the standard Brownian motion and V denotes 
gradient. The crude approach of sampling an Euler dis¬ 
cretisation (Kloeden and Platen] (1992)) of |l]) and us¬ 
ing it as an approximate sample from 7r was introduced 


in the applied literature (Ermak (1975); Doll and Dion 


(19761). The method results in a Markov chain evolving 
according to the dynamics 


:= flC"- 1 ) + — V log 7r(0( n-1 )) 


( 2 ) 


+ h 1/2 N(0,I t 


dxd) 


for a chosen discretisation step h. There is a delicate 
tradeoff between accuracy of the approximation im¬ 
proving as h —>■ 0 and sampling efficiency (as mea¬ 
sured e.g. by the effective sample size) improving when 
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h increases. This solution was soon followed by its Me- 
tropolised version ( jRossky et al. (1978)) that uses the 
Euler approximation of (|2j) to produce a proposal in the 
Metropolis-Hastings algorithm]!] by letting q (-:= 
6'(”- 1 ) + |Vlog7r(6»( n - 1 ))+/i 1 / 2 Af(0,/ dxd ). While in the 
probability community Langevin diffusions and their 
equilibrium distributions had also been around for some 
time (|Kent| (1978)), it was the Roberts and Tweedie 


(1996a I paper (motivated by Besag (19941 comment 


on 


Grenander and Miller (1994)) that brought the ap¬ 


proach to the centre of interest of the computational 
statistics community and sparked systematic study, de¬ 
velopment and applications of Metropolis adjusted Langeviiple 


and more recently Girolami and Calderhead (2011) in¬ 
troduced a mathematically-coherent approach of relat¬ 
ing cr to a metric tensor on a Riemannian manifold of 
probability distributions. The resulting algorithms are 
termed Manifold MALA (MMALA), Simplified MMALA 
(Girolami and Calderhead 2011), and position-dependent 
MALA (PMALA) ( Xifara et al.|2014| ), and differ in im¬ 
plementation cost, depending on how precise is the use 
they make of versions of equation (]3]) . The approach 
still leaves the specification of the metric to be used in 
the space of probability distributions to the user, how¬ 
ever there are some natural choices. One can, for exam- 
take the Hessian of n and replace its eigenvalues 


algorithms (hence MALA) and their cousins. 

There is a large body of empirical evidence that 
at the extra price of computing the gradient, MALA 
algorithms typically provide a substantial speed-up in 
convergence on certain types of problems. However for 
very light-tailed distributions the drift term may grow 
to infinity and cause additional instability. More pre¬ 
cisely, for distributions with sufficiently smooth con¬ 
tours, MALA is geometrically ergodic (c.f. |Roberts and 


by their absolute values A i —» |A* |. Building the metric 
tensor from this spectrally-positive version of the Hes¬ 
sian of 7r and randomising the discretisation step size h 
results in an algorithm that is as robust as random walk 
Metropolis, in the sense that it is geometrically ergodic 
for targets with tail decay of exp{—10|' 3 } for /3 > 1 (see 
Taylor (2014)). A robustified version of such a metric 


Rosenthal (2004)) if the tails of 7r decay as exp{— \6\^} 


with j3 € [1,2], while the random walk Metropolis al¬ 
gorithm is geometrically ergodic for all /3 > 1 (Roberts 


and Tweedie (1996a); Mengersen and Tweedie (1996)). 


has been introduced in Betancourt (2013) and termed 
SoftAbs. Here one approximates the absolute value of 
the eigenspectrum of the Hessian of 7r with a smooth 
strictly positive function \ t -> 

where a is a smoothing parameter. The metric sta¬ 
bilises the behaviour of both MMALA, and Hamilto¬ 
nian Monte Carlo algorithms (discussed in the sequel), 
in the neighbourhoods where the signature of the Hes¬ 
sian changes. 


The lack of geometrical ergodicity has been precisely 
quantified by Bou-Rabee and Hairer (2012). 

Various refinements and extensions have been pro¬ 
posed. These include optimal scaling and choice of the 
discretisation step h , adaptive versions (both discussed 

in Section 2.4), combinations with proximal operators 2 3 Hamiltonian Monte Carlo 
( |Pereyra 2015 Schreck et al. 2013), and applications 
and algorithm development for the infinite-dimensional 
context ( Pillai et al.|2012 Cotter et al.|2013 ). One par¬ 
ticular direction of active research is considering a more 
general version of equation 0 with state-dependent 
drift and diffusion coefficient 


As with many improvements in the literature, starting 
with the very notion of MCMC, Hamiltonian (or hy¬ 


brid) Monte Carlo (HMC) stems from Physics (Duane 


et al.|[l987 ). After a slow emergence into the statistical 


ddi 


-( 


7 i(0 t ) = E 


v{ 9 t) 


2 

da 


V log ir(6t) 
y(0t) 


7(0t) 


dt + s/a(6 t )dBt 


(3) 


community (Neal 1999), it is now central in statistical 
software like STAN (Stan Development Team 2014). 
For a complete account of this important flavour of 

(|2013), which 


MCMC, the reader is referred to Neal 


do. 


which also has 7r as invariant distribution (Xifara et al. 


( |2014| ), c.f. Kent | ( |1978[ )). The resulting proposals are 

q(-\0 (n - 1} ) := J(a(6l( n - 1 ))Vlog7r(6» (n - 1) )+7(6»( n - 1) )) 
+h 1/2 N{o,a(e < ' n - 1) )) + e {n ~ 1) . 

Choosing appropriate a for improved ergodicity is how¬ 
ever nontrivial. The idea has been explored in Stramer 
( 1999a|b ); Roberts and Stramer (2002) 


spired the description below; see also Betancourt et al.] 
|2014| for a highly mathematical differential-geometric 
approach to HMC. 

This method can be seen as a particular and effi¬ 


cient instance of auxiliary variables (see, e.g., Besag and 


Green|[l993 and Rubinstem||1981 ), in which we apply a 


and Tweedie 


deterministic-proposal Metropolis method to the aug¬ 
mented target. In physical terms, the idea behind HMC 
is to add a “kinetic energy” term to the “potential en¬ 
ergy” (negative log-target), leading to the Hamiltonian 

H(9,p) = — log7r(0) + p T M~ 1 p/2 
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where 0 denotes the object to be simulated (i.e., the pa¬ 
rameter), p its speed or momentum and M the Hamil¬ 
tonian matrix of 7 r. In more statistical language, HMC 
creates an auxiliary variable p such that moving accord¬ 
ing to Hamilton’s equations 


e 

dt 

dp 

dt 


dH _ 
dp 
dH 
~~d0 


dH 


= M 


-i„ 


dp 
d log 7T 

= dO 


preserves the joint distribution with density exp {—H{0, 
p)}, hence the marginal distribution of 0 , that is, n(0). 
Hence, if we could simulate exactly this joint distri¬ 
bution of (0,p), a sample from 7 t(9) would be a by¬ 
product. However, in practice, the equation is solved 
approximately and hence requires a Metropolis correc¬ 
tion. As discussed in, e.g., Neal (20131), the dynam¬ 


ics induced by Hamilton’s equations is reversible and 
volume-preserving in the (0,p) space, which means in 
practice that there is no need for a Jacobian in Metropo¬ 
lis updates. The practical implementation relies on a 


k —th order symplectic integrator (Hairer et al. 2006) 


most commonly on the 2-nd order leapfrog approxima¬ 
tion that relies on a small step level e, updating p and 0 
via a modified Euler’s method called the leapfrog that 
is reversible and being symplectic, preserves volume as 
well. This discretised update can be repeated for an 
arbitrary number of steps. 

When considering the implementation via a Metropo¬ 
lis algorithm, a new value of the momentum p is drawn 
from the pseudo-prior oc exp{— p T M~ 1 p/2} and it is 
followed by a Metropolis step, which proposal is driven 
by the leapfrog approximation to the Hamiltonian dy¬ 
namics on (0,p) and which acceptance is governed by 
the Metropolis acceptance probability. What makes the 
potential strength of this augmentation (or dis-integra- 
tion) scheme is that the value of H{0,p) hardly changes 
during the Metropolis move, which means that it is 
most likely to be accepted and that it may produce a 
very different value of 7 r(0) without modifying the over¬ 
all acceptance probability. In other words, moving along 
level sets is almost energy-free, but if the move proceeds 


of the time discretisation step e in the leapfrog approx¬ 
imation and of the number L of leapfrog leaps, but also 
the choice of the precision matrix M.) 

2.4 Optimal scaling and Adaptive MCMC 

The convergence of the Metropolis-Hastings algorithm^ 
depends crucially on the choice of the proposal distribu¬ 
tion q , as does the performance of both more complex 
MCMC and SMC algorithms, that often are hybrids 
using Metropolis-Hastings as simulation substeps. 

Optimising over all implementable q appears to be a 
“disaster problem” due to its infinite-dimensional char¬ 
acter, lack of clarity about what is implementable, what 
is not, and the fact that this optimal q must depend in 
a complex way on the target 7r to which we have only 
a limited access. In particular MALA provides a spe¬ 
cific approach to constructing 7r-tailored proposals and 
HMC can be viewed as a combination of Gibbs and 
special Metropolis moves for an extended target. 

In this optimisation context, it is thus reasonable to 
restrict ourselves to some parametric family of propos¬ 
als <?£, or more generally of Markov transition kernels 
Pj, where £ £ E is a tuning parameter, possibly high¬ 
dimensional. 

The aim of adaptive Markov chain Monte Carlo is 
conceptually very simple. One expects that there is a 
set E n C E of good parameters £ for which the kernel 

converges quickly to 7r, and one allows the algorithm 
to search for E^ “on the fly” and redesign the transition 
kernel during the simulation as more and more infor¬ 
mation about 7T becomes available. Thus an adaptive 
MCMC algorithm would apply the kernel P^ n ) to sam¬ 
ple 0^ given 0^ n 1 ^, where the tuning parameter 
is itself a random variable which may depend on the 
whole history 0^°\ ..., 0^ n ~ 1 '> and on Adaptive 

MCMC rests on the hope that the adaptive parame¬ 
ter will find E n , stay there essentially forever and 
inherit good convergence properties. 

There are at least two fundamental difficulties in ex¬ 
ecuting this strategy in practice. First, standard mea¬ 
sures of efficiency of Markovian kernels, like the to- 


Hastings algorithms. What drives the exploration of the 
different values of H(0,p) is therefore the simulation of 
the momentum, which makes its calibration both quite 


for “long enough”, the chain can reach far-away regions 

;al variation convergence rate (c.f. I 

deyn and Tweedie 

of the parameter space, thus avoid the myopia of stan- 

(2009) 

; Roberts and Rosenthal ( 

200 

4)), L 2 (tt) spectral 

dard MCMC algorithms. As explained in Neal (2013), gap (E 

Jiaconis and Stroock ( 

1991 

; Roberts (]1996); 

S aloft- 

this means that Hamiltonian Monte Carlo avoids the 

Coste 

(1997); Levin et al. 

(2009 

)) or asymptotic vari- 

inefficient random walk behaviour of most Metropolis- 

ance ( 

Peskun (1973); Geye 

r (1992); 

Tierney (1998 

)) in 


influential and delicate (Betancourt et al. (2014)) as it 


depends on the unknown normalising constant of the 
target. (By calibration, we mean primarily the choice 


the Markov chain central limit theorem will not be 
available explicitly, and their estimation from a Markov 
chain trajectory is often an even more challenging task 
than the underlying MCMC estimation problem itself. 

Secondly, when executing an adaptive strategy and 
trying to improve the transition kernel on the fly, the 
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Markov property of the process is violated, therefore 
standard theoretical tools do not apply, and establish¬ 
ing validity of the approach becomes significantly more 
difficult. While the approach has been successfully ap¬ 


plied in some very challenging practical problems (Solo- 


nen et al. (2012); Richardson et al. (2010); Griffin et al. 


(2014)), there are examples of seemingly reasonable adap¬ 
tive algorithms that fail to converge to the intended 


target distribution (Bai et al. (2011); Latuszyriski et al. 


(2013)), indicating that compared to standard MCMC 
even more care must be taken to ensure validity of in¬ 
ferential conclusions. 

While heuristics-based adaptive algorithms have been 


considered already in Gilks et al. (19941, a remarkable 


result providing a tool to address the difficulty of op¬ 
timising Markovian kernels coherently is the Roberts 


et al. (1997) paper on scaling the proposal variance. It 


considers settings of increasing dimensionality and in¬ 
vestigates efficiency of the random walk Metropolis al¬ 
gorithm as a function of its average acceptance rate. 
More specifically, given a sequence of targets 7r d on 
the product state space 0 d with iid components con¬ 
structed from conveniently smooth marginal /, 


Kd{9) for rf = 1,2,... (4) 

i=i 

consider a sequence of Markov chains 9d, d = 1,2,..., 
where the chain 9d = (9 ^) n= is a random walk 
Metropolis targeting 7t 'd with proposal increments dis¬ 
tributed as N(0,a^ldxd)- 

It then turns out that the only sensible scaling of 
the proposal as dimensionality increases is to take a 2 = 
l 2 d~ x . In this regime the sequence of time-rescaled first 
coordinate processes 



(LtdJ) 

d, 1 > 


for d = 1,2 ,... 


converges in a suitable sense to the solution Z of a 
stochastic differential equation 

dZ t = Hiy/ 2 dB t + \h(l)\7 log f(Z t )dt. 

Hence maximising the speed of the above diffusion h{l) 
is equivalent to maximising the efficiency of the algo¬ 
rithm as the dimension goes to infinity. Surprisingly, 
there is a one-to-one correspondence between the value 
lopt = argmax/i(7) and the mean acceptance probability 
of 0.234. 

The magic number 0.234 does not depend on / and 
gives a universal tuning recipe to be used for example 
in adaptive algorithms: choose the scale of the incre¬ 
ment so that approximately 23% of the proposals are 
accepted. 

The result, established under restrictive assumptions, 
has been empirically verified to hold much more gener¬ 
ally, for non iid targets and also in medium- and even 


low-dimensional examples with d as small as 5. It has 
been also combined with relative efficiency loss due to 
mismatch between the proposal and target covariance 
matrices (see Roberts and Rosenthal (2001)). 

The simplicity of the result and easy access to the 
average acceptance rate makes optimal scaling the main 
theoretical driver in development of adaptive MCMC 
algorithms, and adaptive MCMC is the main applica¬ 
tion and motivation for researching optimal scaling. 

A large body of theoretical work extends optimal 
scaling formally to different and more general scenarios. 
For example Metropolis for smooth non iid targets has 


been addressed e.g. by Bedard (2007), and in infinite 
dimensional settings by Beskos et al. (2009). Discrete 


and other discontinuous targets have been considered in 


Roberts (1998) and Neal et al. (2012). For MALA al¬ 


gorithms an optimal acceptance rate of 0.574 has been 
established in Roberts and Rosenthal (1998) and con¬ 


firmed in infinite-dimensional settings in Pillai et al.| 
(20121 along with the stepsize a J = l 2 d ~ x ^. Hybrid 
Monte Carlo (see Section 2.3) has been analysed in a 


similar spirit by Beskos et al. (20131 and Betancourt 


et al. (20141 concluding that any value € [0.6, 0.9] will 


be close to optimal and the leapfrog step size should be 
taken as h = l x d -1 / 4 . These results not only inform 
about optimal tuning, but also provide an efficiency or¬ 
dering on the algorithms in d— dimensions. Metropolis 
algorithms need 0{d) steps to explore the state space, 
while MALA and HMC need respectively 0(d 1 ^ 3 ) and 
©(d 1 / 4 ). 

Further extensions include studying the transient 


phase before reaching stationarity (Christensen et al. 


(20051; Jourdain et al. (2012 

201 

4)), the scaling of 

multiple-try MCMC (Bedard et al. ( 

2012 )) and delayed 

rejection MCMC (Bedard et al. 

(20141), and the tern- 

perature scale of parallel tempering type algorithms 

(Atchade et al. (2011b 1; Roberts and Rosenthal (20141). 


Interestingly, the optimal scaling of the discussed in 


Section 2.5 pseudo-marginal algorithms as obtained in 

Sherlock et al. 

(2014 

), and extended to more general 

settings in 

Doucet et al. 

(20121; 

Sherlock 

(2014 

), sug- 


gests an acceptance rate of just 0.07. 

While each of these numerous optimal scaling re¬ 
sults gives rise, at least in principle, to an adaptive 
MCMC design, the pioneering and most successful algo¬ 
rithm is the Adaptive Metropolis of Haario et al. (2001). 
With its increasing popularity in applications, this has 
fuelled the development of the field. 


Here one considers a normal increment proposal that 
estimates the target covariance matrix from past sam¬ 
ples and applies appropriate dimension-dependent scal¬ 
ing and covariance shrinkage. Precisely, the proposal 
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takes the form 

g(-|0< n_1) ) = 7V(6» (n - 1) ,C ,(n) ), (5) 

with the covariance matrix 

C {n) = ^^(co v(eW,...M n ~ 1) )+eIdx d ) (6) 

which is efficiently computed using a recursive formula. 
Versions and refinements of the adaptive Metropolis 


algorithm ( 

loberts and Rosenthal 

2009 

Andrieu and 

Thoms) 200? 

) have served well in applications and mo- 


lis, delayed rejection adaptive Metropolis (Haario et al. 


(2006[)), regional adaptation and parallel chains (Craiu 


et al.l 2009), and the robust version of Vihola (20121 


estimating the shape of the distribution rather than its 
covariance matrix and hence suitable for heavy tailed 
targets. 

Analogous development of adaptive MALA algo¬ 


rithms in Atchade (2006); Marshall and Roberts (20121 


and of adaptive Hamiltonian and Riemannian Manifold 


Monte Carlo in Wang et al. (2013) building on the adap¬ 


tive scaling theory, resulted in a similar drastic mixing 
improvement as the original Adaptive Metropolis. 

Another substantial and still unexplored area where 
adaptive algorithms are applied for very high dimen¬ 
sional and multimodal problems is model and variable 


selection (Nott andKohn (2005); Richardson et al. (2010); 


Lamnisos et al. (2013); Ji and Schmidler (2013); Grif¬ 


fin et al. (2014)). These algorithms can incorporate re¬ 


versible jump moves (Green 1995) and are guided by 


scaling limits for discrete distributions as well as tem¬ 
perature spacing of parallel tempering to address mul¬ 
timodality. Successful implementations allow for fully 
Bayesian variable selection in models with over 20 000 
variables for which otherwise only ad hoc heuristic ap¬ 
proaches have been used in the literature. 

To address the second difficulty with adaptive algo¬ 
rithms, several approaches have been developed to es¬ 
tablish their theoretical underpinning. While for stan¬ 
dard MCMC, convergence in total variation and law 
of large numbers are obtained almost trivially, and the 
effort concentrates on stronger results, like CLTs, geo¬ 
metric convergence, nonasymptotic analysis, and, maybe 
most importantly, comparison and ordering of algorithms, 
adaptive samplers are intrinsically difficult. The most 
elegant and theoretically-valid strategy is to change the 
underlying Markovian kernel at regeneration times only 


(Gilks et al. (1998)). Unfortunately, this is not very ap¬ 


pealing for practitioners since regenerations are diffi¬ 
cult to identify in more complex settings and are essen¬ 
tially unpractically rare in high dimensions. The orig¬ 


inal Adaptive Metropolis of Haario et al. (2001) has 


been validated (under some restrictive additional con¬ 
ditions) by controlling the dependencies introduced by 
the adaptation and using convergence results for mixin- 
gales. The approach has been further developed in|Atch4de 


and Rosenthal (2005) and Atchade (2006) to verify its 


ergodicity under weaker assumptions and apply the mixin- 
gale approach to adaptive MALA. Another successful 


approach (Andrieu and Moulines (2006) refined in Saks- 


man and Vihola (2010)) rests on martingale difference 


tivated much of the theoretical development. These in¬ 
clude, among many other contributions, adaptive Metropo- 


approximations and martingale limit theorems to ob¬ 
tain, under suitable technical assumptions, versions of 
LLN and CLTs. There are close links between analysing 
adaptive MCMC and stochastic approximation algo¬ 
rithms and in particular the adaptation step can be 
often written as a mean field of the stochastic approxi¬ 


mation procedure; Andrieu and Robert (2001); Atchade 


et al. (2011a); Andrieu et al. (2015) contribute to this 


direction of analysis. Fort et al. (2011) develop an ap¬ 


proach where both adaptive and interacting MCMC al¬ 
gorithms can be treated in the same framework. This al¬ 
lows addressing “external adaptation” algorithms such 
as the interacting tempering algorithm (a simplified 


version of the celebrated equi-energy sampler of Kou 


et al. ( 

2006 

) or ac 

jedow et al. 

(2013) 


We present here the rather general but fairly simple 


coupling approach (Roberts and Rosenthal (2007)) to 
establishing convergence. Successfully applied to a va¬ 
riety of adaptive Metropolis samplers under weak regu¬ 


larity conditions (Bai et al. (20111), adaptive Gibbs and 
adaptive Metropolis within adaptive Gibbs samplers 


(Latuszynski et al. (2013)), it shows that two prop¬ 
erties Diminishing Adaptation and Containment are 
sufficient to guarantee that an adaptive MCMC algo¬ 
rithm will converge asymptotically to the correct target 
distribution. To this end recall the total variation dis¬ 
tance between two measures defined as ||i/(-) — /z(-)|| := 
sup^gjr | v(A) — /t(A)|, and for every Markov transition 
kernel Pj, £ £ E and every starting point 9 € O define 
the e convergence function M e : 0 x E —> N as 


M s (0,£) := inf{n > 1 : 


|Pi n) (M- 7 r(-)ll<4- 


Let {(9^ n \ ^ n ^)}^T 0 be the corresponding adaptive MCMC 
algorithm and by A^ n ^((6, £), •) denote its marginal dis¬ 
tribution at time t, i.e. 

A (n) ((<9,£),P) :=P(6> (n) e B\9 {0) = 0,£ (o) = £)• 

The adaptive algorithm is ergodic for every starting val¬ 
ues of 9 and £ if lim^oo ||A(")(($, £, •) — 7r(-)|| =0. The 
two conditions guaranteeing ergodicity are 

Definition 1 (Diminishing Adaptation) The adap¬ 
tive algorithm with starting values 9 = 9 and = £ 
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satisfies Diminishing Adaptation, if 
lim D {n) = 0 in probability, where 

n—>oo 

D (n) := sup ||P C („ + i)(0,-)-P € c») (<?,■) II- 

see 

Definition 2 (Containment) The adaptive algorithm 
with starting values 9^ = 9 and £(°) = £ satisfies Con¬ 
tainment, if for all e > 0 the sequence {M e (9^ n \^ n ^) }J£L 0 
is bounded in probability. 


While diminishing adaptation is a standard require¬ 
ment, Containment is subject to some discussion. On 
one hand, it may seem difficult to verify in practice; on 
the other, it may appear restrictive in the context of er- 
godicity results under some weaker conditions (c.f. 


Fort 


et al. (2011)). However, it turns out (Latuszyriski and 


Rosenthal (20141) that if Containment is not satisfied, 


then the algorithm may still converge, but with posi¬ 
tive probability it will be asymptotically less efficient 
than any nonadaptive ergodic MCMC scheme. Hence 
algorithms that do not satisfy Containment are termed 
AdapFail and are best avoided. Containment has been 
further studied in Bai et al. (2011) and is in particular 
implied by simultaneous geometric or polynomial drift 
conditions of the adaptive kernels. 

Given that adaptive algorithms may be incorpo¬ 
rated in essentially any sampling scheme, their intro¬ 
duction seems to be one of the most important inno¬ 
vations of the last two decades. However, despite sub¬ 
stantial effort and many ingenious contributions, the 
theory of adaptive MCMC lags behind practice even 
more than may be the case in other computational ar¬ 
eas. While theory always matters, the numerous unex¬ 
pected and counterintuitive examples of transient adap¬ 
tive algorithms suggest that in this area theory matters 
even more for healthy development. 

For adaptive MCMC to become a routine tool, a 
clear-cut result is needed saying that under some eas¬ 
ily verifiable conditions these algorithms are valid and 
perform not much worse than their nonadaptive coun¬ 
terpart with fixed parameters. Such a result is yet to be 
established and may require deeper understanding of 
how to construct stable adaptive MCMC, rather than 
aiming heavy technical artillery at algorithms currently 
in use without modifying them. 


2.5 Estimated likelihoods and pseudo-marginals 

There are numerous settings of interest where the tar¬ 
get density 7r(j y) is not available in closed form. For 
instance, in latent variable models, the likelihood func¬ 
tion £{9\y) is often only available as an intractable in¬ 


tegral 


£{Q\y) = J^g(z,y\ 9 ) dz, 


which leads to 


tt(%) oc t r(0) / g(z,y\9)dz 


being equally intractable. A solution proposed from the 


early days of MCMC (Tanner and Wong 1987) is to con¬ 


sider z as an auxiliary variable and to simulate the joint 
distribution w(9, z\y ) on 0 x Z by a standard method, 
leading to simulating the marginal density 7r(-| y) as a 
by-product. However, when the dimension of the aux¬ 
iliary variable z grows with the sample size, this tech¬ 
nique may run into difficulties as induced MCMC al¬ 
gorithms are more and more likely to have convergence 
issues. An illustration of this case is provided by hid¬ 
den Markov models, which have eventually to resort to 
particle filters as Markov chain algorithms become in¬ 


effective (Chopin 2007 Fearnhead and Clifford 2003). 


Another situation where the target density 7r(-|y) can¬ 
not be directly computed is the case of the “doubly 
intractable” likelihood (Murr ay et al.|20 06a), when the 
likelihood function £{9\y) oc g(y\9) itself contains a term 
that is intractable, in that it makes the normalising con¬ 
stant 


3(6*)= f g(y\0)dy 
Jy 


iy 

impossible to compute. The resulting posterior writes 

7r(%(2/|0) 


r (%) = 


where 


-d0. 


3 (9)p(y) 
i r(9)g(y\ 6) 

ie m 

and consequently the Metropolis-Hastings acceptance 
rate becomes 


p(y) = / 

Jo 




AOMv.ewm 3(0') 

and cannot be evaluated exactly for algorithmic pur¬ 
poses. 

Examples of this kind abound in Markov random 


fields models, as for instance for the fsing model (Mur- 
ray et al.]|2006b Mpller et al.||2006). 


Both the approaches of Murray et al. (2006a) and 


Mpller et al. (2006) require sampling data from the like¬ 


lihood £(6\y), which limits their applicability. The latter 
uses in addition an importance sampling function and 
may suffer from poor acceptance rates. 


Andrieu and Roberts (2009) propose a more general 


resolution of such problems by designing a Metropolis- 
Hastings algorithm that replaces the intractable tar¬ 
get density 7r(-| y) with an unbiased estimator, following 
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an idea of Beaumont (2003). The approach is termed even not be ergodic (Medina-Aguayo et al. 2015). If 


pseudo-marginal. Rather than evaluating the posterior 
exactly, a positive unbiased estimate Sg of 7r(f%) is 
utilised. More formally, a new Markov chain is con¬ 
structed that evolves on the extended state space 0 x 
M+, where at iteration n, given the pair (0(" _1 ), ) 

of parameter value and density estimate at this value, 
the proposal (9 1 , Sg> ) is obtained by sampling 9' ~ q{-\9 v 
and obtaining Sg>, the estimate of n(9'\y). Analogously 
to the standard Metropolis-Hastings step, the pair ((?', Sg >) 
is accepted as {9^ n \ S 1 ^) with probability 

\ 1 Sg,q{8^\9') 1 


ergodic, the difference between stationary distribution, 
resulting from the noisy acceptance must be quantified, 
which is a highly nontrivial task and the bounds will 


rarely be tight (see also Alquier et al. (2014); Pillai and 


I ' s£-_\\q(6'\90-V) j ’ 

otherwise the proposal is rejected and the new value set 


as 


(i 9 {n) , S&1 ) := {9^ , SfclV)). 


It is not difficult to verify that the bivariate chain on 
extended state space 0 x R + enjoys the correct n(9\y) 


marginal on 0 and the approach is valid, see Andrieu 


and Roberts (2009)) for details (and also Andrieu and 


|Smith| p014[ ); Rudolf and Schweizer (2015) for related 
methodology and theory). The approach is however an 
^(Interesting avenue since at the price of being biased, 
it overcomes mixing difficulties of the exact pseudo¬ 
marginal version. 

Design and understanding of pseudo-marginal algo¬ 
rithms is a direction of dynamic methodological devel¬ 
opment that in the coming years will be further fu¬ 
elled not only by complex models with intractable like¬ 
lihoods, but also by the need of MCMC algorithms for 
Big Data. In this context the likelihood function cannot 
be evaluated for the whole dataset even in the iid case 
just because computing the long product of individual 
likelihoods is infeasible. Several Big Data MCMC ap- 


Vihola (2015) for an abstracted account). 


and Vihola (2015) where the efficiency of the algorithm 


is studied in terms of its spectral gap and CLT asymp- 


totic variance 

Sherlock et al. 

rH 

o 

CN 

Doucet et al. ( 

2012 ) 

and 

Sherlock 

to 

o 

I — 1 

), on the other hand, investigate the 


efficiency as a function of the acceptance rate and vari¬ 
ance of the noise, deriving the optimal scaling, as dis¬ 
cussed in Section [2741 

As an alternative to the above procedure of using 
estimates of the intractable likelihood to design a new 
Markov chain on an extended state space with correct 
marginal, one could naively use these estimates to ap¬ 
proximate the Metropolis-Hastings accept-reject ratio 
and let the Markov chain evolve in the original state 
space. This would amount to dropping the current re¬ 
alisation of Sg and obtaining a new one in each accept- 
reject attempt. Such a procedure is termed Monte Carlo 
within Metropolis (Andrieu and Ro berts|[20 09). Unfor¬ 
tunately this approach does not preserve the station¬ 
ary distribution, and the resulting Markov chain may 


proaches have been already considered in Welling and 


One specific instance of constructing unbiased esti¬ 
mators of the posterior is presented in |Girolami et al.| 
(2013) and based on random truncations of infinite se¬ 
ries expansions. The paper also offers an excellent overview 
of inference methods for intractable likelihoods. 

The performance of the pseudo-marginal approach 
will depend on the quality of the estimators Sg and 
hence stabilising them as well as understanding this re¬ 
lationship is an active area of current development. Of¬ 
ten Sg is constructed as an importance sampler based 
on an importance sample z. Thus in particular, the im¬ 
provements from using multiple samples of z to esti¬ 
mate 7r are of interest and can be assessed from lAndrieuI 


Teh ( 

2011 ); 

Korattikara et al. 

(2013) 

Teh et al. 1 

2014); 

Bardenet et al. (2014J); Maclaurin and Adams 1 

2014); 

Minsker et al. (2014); Quiroz et al. (20141; Strathmann 

et al. 

(2015). 





2.6 Particle MCMC 

While we refrain from covering particle filters here, since 
others ( Beskos et al.j 2015) in this volume are focussing 
on this technique, a recent advance at the interface be¬ 
tween MCMC, pseudo-marginals, and particle filtering 
is the notion of particle MCMC (or pMCMC), devel¬ 
oped by Andrieu et al. (2011). This innovation is in¬ 
deed rather similar to the pseudo-marginal algorithm 
approach, taking advantage of the state-space models 
and auxiliary variables used by particle filters. It differs 
from standard particle filters in that it targets (mostly) 
the marginal posterior distribution of the parameters. 

The simplest setting in which pMCMC applies is one 
of a state-space model where a latent sequence xq-.t is 
a Markov chain with joint density 

p 0 (x 0 \9)p 1 (xi\x 0 ,9)) ■ ■ ■ p T (x T \x T -i, 0 ), 

and is associated with an observed sequence y\ : r such 
that 

T 

yi:T\xi:T, 0 ~ qi{yi\xi, 9). 

i=1 

The iterations of pMCMC are MCMC-like in that, at 
iteration t, a new value 9' of 9 is proposed from an arbi¬ 
trary transition kernel b(-|#W) and then a new value of 
the latent series x' 0 . T is generated from a particle filter 
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approximation of p{xq-.t\0' , Ui-.t)- Since the particle fil¬ 
ter returns as a by-product ( Del Moral et al.||2006 ) an 
unbiased estimator of the marginal posterior of Ui-.t, 
u{Ut.tW), this estimator can be used as such in the 
Metropolis- Hastings ratio 


9(yi:T|6 l )7r(6 ,(t) )h(6 ,, |d (t) ) 


A 1. 


Its validity follows from the general argument of Am] 


drieu and Roberts (2009), although some additional 


(notational) effort is needed to demonstrate all random 
variables used therein are correctly assessed (see An- 


| drieu et al. 2011 and Wilkinson |2011a the latter pro¬ 
viding a very progressive introduction to the notions of 
pMCMC and particle Gibbs, which helped greatly in 
composing this section). Note however that the general 
validation of pMCMC as targetting the joint posterior 
of the states and parameters and of the parallel parti¬ 
cle Gibbs sampler does not follow from pseudo-marginal 
arguments. 

This approach is being used increasingly in com¬ 
plex dynamic models like those found in signal process¬ 


ing (Whiteley et al. 2010), dynamical systems like the 


PDEs in biochemical kinetics (Wilkinson 2011b) and 


probabilistic graphical models (Lindsten et al. 2014). 


An extension to approximating the sequential filtering 


distribution is found in Chopin et al. (2013). 


2.7 Parallel MCMC 

Since MCMC relies on local updating based on the 
current value of a Markov chain, opportunities for ex¬ 
ploiting parallel resources, either CPU or GPU, would 
seem quite limited, In fact, the possibilities reach far 
beyond the basic notion of running independent or cou¬ 
pled MCMC chains on several processors. For instance, 


Craiu and Meng (2005) construct parallel antithetic 


coupling to create negatively correlated MCMC chains 
(see also Frigessi et al.| 20001, while Craiu et al. (20091 
use parallel exploration of the sample space to tune 


an adaptive MCMC algorithm. Jacob et al. (2011) ex¬ 


ploit GPU facilities to improve by Rao-Blackwellisation 
the Monte Carlo approximations produced by a Markov 
chain, even though the parallelisation does not improve 


the convergence of the chain. See also Lee et al. (2009) 


and Suchard et al. (2010) for more detailed contribu¬ 


tions on the appeal of using GPUs towards massive par¬ 


allelisation, and Wilkinson (2005) for a general survey 
on the topic. 

Another recently-explored direction is “prefetching”. 


Based on Brockwell (2006) this approach computes the 
2 2 , 2 3 ,..., 2 k values of the posterior that will be needed 


2 ,3,..., k sweeps ahead by simulating the possible “fu¬ 
tures” of the Markov chain, according to whether the 
next k proposals are accepted or not, in parallel. Run¬ 
ning a regular Metropolis-Hastings algorithm then means 
building a decision tree back to the current iteration 
and drawing 2,3,... ,k uniform variates to go down 
the tree to the appropriate branch. As noted by [Brock- 
well (20061, “in the case where one can guess whether 


or not acceptance probabilities will be ‘high’ or ‘low’, 
the tree could be made deeper down ‘high’ probabil¬ 
ity paths and shallower in the ‘low’ probability paths.” 


This idea is exploited in Angelino et al. (2014), by creat 


ing “speculative moves” that consider the reject branch 
of the prefetching tree more often than not, based on 
some preliminary or dynamic evaluation of the accep¬ 
tance rate. Using a fast but close-enough approxima¬ 
tion to the true target (and a fixed sequence of uni¬ 
forms) may also produce a “single most likely path” 
on which prefetched simulations can be run. The ba¬ 
sic idea is thus to run simulations and costly likeli¬ 
hood computations on many parallel processors along 
a prefetched path, a path that has been prefetched for 
its high approximate likelihood. There are obviously in¬ 
stances where this speculative simulation is not helpful 
because the actual chain with the genuine target ends 


up following another path. Angelino et al. (2014) ac¬ 


tually go further by constructing sequences of approx¬ 
imations for the precomputations. The proposition for 
the sequence found therein is to subsample the original 
data and use a normal approximation to the difference 


of the log (sub-)likelihoods. See Strid (2010) for related 
ideas. 

A different use of parallel capabilities is found in 


Calderhead (2014). At each iteration of Calderhead’s 


algorithm, N replicas are generated, rather than 1 in 
traditional Metropolis-Hastings. The Markov chain ac¬ 
tually consists of N components, from which one com¬ 
ponent is selected at random as a seed for the next 
proposal. This approach can be seen as a special type 
of data augmentation (Tanner and Wong 19871, where 
the index of the selected component is an auxiliary vari¬ 
able. The neat trick in the proposal (and the reason 
for its efficiency gain) is that the stationary distribu¬ 
tion of the auxiliary variable can be determined and 
hence used N times in updating the vector of N com¬ 
ponents. An interesting feature of this approach is when 
the original Metropolis-Hastings algorithm is expressed 
as a finite state space Markov chain on the set of in¬ 
dices {1, ...,1V'}. Conditional on the values of the N 
dimensional vector, the stationary distribution of that 
sub-chain is no longer uniform. Hence, picking N in¬ 
dices from the stationary helps in selecting the most 
appropriate images, which explains why the rejection 
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rate decreases. The paper indeed evaluates the impact 
of increasing the number of proposals in terms of ef¬ 
fective sample size (ESS), acceptance rate, and mean 
squared jump distance. Since this proposal is an almost 
free bonus resulting from using N processors, it sounds 
worth investigating and comparing with more complex 
parallel schemes. 


Neiswanger et al. (20131 introduced the notion of 


embarrassingly parallel MCMC, where “embarrassing” 
refers to the “embarrassingly simple” solution proposed 
therein, namely to solve the difficulty in handling very 
large datasets by running completely independent par¬ 
allel MCMC samplers on parallel threads or comput¬ 
ers and using the outcomes of those samplers as den¬ 
sity estimates, pulled together as a product towards an 
approximation of the true posterior density. In other 
words, the idea is to break the posterior as 


p(9\y) IIM%) 


and to use the estimate 


(7) 


p(%) « Umv) 


i =1 

where the individual estimates are obtained, say, non- 
parametrically. The method is then “asymptotically ex¬ 
act” in the weak (and unsurprising) sense of converg¬ 
ing in the number of MCMC iterations. Still, there is 
a theoretical justification that is not found in previous 
parallel methods that mixed all resulting samples with¬ 
out accounting for the subsampling. And the point is 
made that, in many cases, running MCMC samplers 
with subsamples produces faster convergence. The de¬ 
composition of p(-) into its components is done by par¬ 
titioning the iid data into M subsets and taking a power 
1 /to of the prior in each case. (This may induce issues 
about impropriety.) However, the subdivision is arbi¬ 
trary and can thus be implemented in cases other than 
the fairly restrictive iid setting. Because each (subsam¬ 
ple) nonparametric estimate involves T terms, the re¬ 
sulting overall estimate contains Tm terms and the au¬ 
thors suggest using an independent Metropolis sampler 
to handle this complexity. This is in fact necessary for 
producing a final sample from the (approximate) true 
posterior distribution. 


In a closely related way, Wang and Dunson (2013) 


start from the same product representation of the tar¬ 
get (posterior), namely, (f7|. Howe ver, they criticise the 
choice made by Neiswanger et al. (2013) to use MCMC 


approximations for each component of the product for 
the following reasons: 

1. Curse of dimensionality in the number of parameters 

d; 


2. Curse of dimensionality in the number of subsets to; 

3. Tail degeneration; 

4. Support inconsistency and mode misspecification. 

Point 1 is relevant, but there may be ways other than 
kernel estimation to mix samples from the terms in the 
product. Point 2 is less of a clearcut drawback: while 
the Tm terms corresponding to a product of to sums of 


T terms sounds self-defeating, Neiswanger et al. (2013) 


use a clever device to avoid the combinatorial explo¬ 
sion, namely operating on one component at a time. 
Having non-manageable targets is not such an issue 
in the post-MCMC era. Point 3 is formally correct, in 
that the kernel tail behaviour induces the kernel esti¬ 
mate tail behaviour, most likely disconnected from the 
true target tail behaviour, but this feature is true for 
any non-parametric estimate, even for the Weierstrass 
transform defined below, and hence maybe not so rel¬ 
evant in practice. In fact, by lifting the tails up, the 
simulation from the subposteriors should help in visit¬ 
ing the tails of the true target. Finally, point 4 does not 
seem to be life-threatening. Assuming that the true tar¬ 
get can be computed up to a normalising constant, the 
value of the target for every simulated parameter could 
be computed, eliminating those outside the support of 
the product and highlighting modal regions. 

The Weierstrass transform of a density / is a con¬ 
volution of / and of an arbitrary kernel K. |Wang and 


Dunson (2013) propose to simulate from the product of 


the Weierstrass transform, using a multi-tiered Gibbs 
sampler. Hence, the parameter is only simulated once 
and from a controlled kernel, while the random effects 
from the convolution are related with each subposte¬ 
rior. While the method requires coordination between 
the parallel threads, the components of the target are 
separately computed on a single thread. The clearest 
perspective on the Weierstrass transform may possi¬ 
bly be the rejection sampling version where simulations 
from the subpriors are merged together into a normal 
proposal on 6 , to be accepted with a probability de¬ 
pending on the subprior simulations. 


VanDerwerken and Schmidler (2013) keep with the 


spirit of parallel MCMC papers like consensus Bayes 


son 


(Scott et al. 2013), embarrassingly parallel MCMC (Nei: 


et al. 2013) and Weierstrass MCMC (Wang and Dun 


wanger 


2013), namely that the computation of the likeli¬ 


hood can be broken into batches and MCMC run over 
those batches independently. The idea of the authors is 
to replace an exploration of the whole space operated 
via a single Markov chain (or by parallel chains act¬ 
ing independently which all have to “converge”) with 
parallel and independent explorations of parts of the 
space by separate Markov chains. The motivation is 
that “Small is beautiful”: it takes a shorter while to 
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explore each set of the partition, hence to converge, 
and, more importantly, each chain can work in paral¬ 
lel with the others. More specifically, given a partition 
of the space, into sets Aj with posterior weights w t . 
parallel chains are associated with targets equal to the 
original target restricted to those AjS. This is there¬ 
fore an MCMC version of partitioned sampling. With 
regard to the shortcomings listed in the quote above, 
the authors consider that there does not need to be a 
bijection between the partition sets and the chains, in 
that a chain can move across partitions and thus con¬ 
tribute to several integral evaluations simultaneously. 
It is somewhat unclear (a) whether or not this im¬ 
pacts ergodicity (it all depends on the way the chain 
is constructed, i.e., against which target) as it could 
lead to an over-representation of some boundary regions 
and (b) whether or not it improves the overall conver¬ 
gence properties of the chain(s). A more delicate issue 
with the partitioned MCMC approach stands with the 
partitioning. Indeed, in a complex and high-dimension 
model, the construction of the appropriate partition is 
a challenge in itself as we often have no prior idea where 
the modal areas are. Waiting for a correct exploration 
of the modes is indeed faster than waiting for crossing 
between modes, provided all modes are represented and 
the chain for each partition set Ai has enough energy 
to explore this set. It actually sounds unlikely that a 
target with huge gaps between modes will see a consid¬ 
erable improvement from the partioned version when 
the partition sets A, are selected on the go, because 
some of the boundaries between the partition sets may 
be hard to reach with an off-the-shelf proposal. A last 
comment about this innovative paper is that the adap- 


it may be a central feature of computational Bayesian 
statistics in the coming years. From a statistical per¬ 
spective, it also induces a somewhat paradoxical situa¬ 
tion where loss of information is balanced by improve¬ 
ment in precision, for a given computational budget. 
This perspective is not only interesting at the computa¬ 
tional level but forces us (as statisticians) to re-evaluate 
in depth the nature of a statistical model and could 
produce a paradigm shift in the near future by giving 
a brand new meaning to George Box’s motto that “all 
models are wrong”. 

3.1 ABC per se 

It seems important to discuss ABC (Approximate Bayesian 
computation) in this partial tour of Bayesian computa¬ 
tional techniques as (a) they provide the only approach 
to their model for some Bayesians, (b) they deliver 
samples in the parameter space that are exact simu¬ 
lations from a posterior of some kind (Wilkinson 2013), 


7r ABC (0|yo) if not the original posterior 7r(0|yo), where 
yo denotes the data in this section (c) they may be more 
intuitive to some researchers outside statistics, as they 
entail simulating from the inferred model, i.e., going 
forward from parameter to data, rather than backward, 
from data to parameter, as in traditional Bayesian infer¬ 
ence, (d) they can be merged with MCMC algorithms, 
and (e) they allow drawing inference directly from sum¬ 
maries of the data rather than the data itself. 

ABC techniques play a role in the 2000s that MCMC 
methods did in the 1990s, in that they handle new mod¬ 
els for which earlier (e.g., MCMC) algorithms were at 
a loss, in the same way the latter (MCMC) were able 


with Wang-Landau schemes (Wang and Landau 2001 

to handle models that regular Monte Carlo approaches 
could not reach, such as latent variable models (Tanner 

Lee et al.|2005 Atchade and Liu 2010 Jacob and Ryder 

and Wong|1987 

Diebolt and Robert| 1994 Richardson 

2014). 

and Green [1997 

). New models for which ABC unlocked 


3 ABC and others, exactly delivering an 
approximation 

Motivated by highly complex models where MCMC 
algorithms and other Monte Carlo methods were too 
inefficient by far, approximate methods have emerged 
where the output cannot be considered as simulations 
from the genuine posterior, even under idealised situ¬ 
ations of infinite computing power. These methods in¬ 
clude ABC techniques, described in more details below, 
but also variational Bayes ( Jaakkola and Jordan|20ti0 ), 
empirical likelihood (Owen 2001), INLA (Rue et al. 


2009) and other solutions that rely on pseudo-models, 


or on summarised versions of the data, or both. It is 
quite important to signal this evolution as we think that 


the gate include Markov random fields, Kingman’s co- 
alescent for phylogeographical data, likelihood models 
with an intractable normalising constant, and models 
defined by their quantile function or their characteris¬ 
tic function. While the ABC approach first appeared a 
“quick-and-dirty” solution, to be considered only until 
more elaborate representations could be found, those al¬ 
gorithms have been progressively incorporated into the 
statistician’s toolbox as a novel form of generic non- 
parametric inference handling partly-defined statistical 
models. They are therefore attractive as much for this 
reason as for being handy computational solutions when 
everything else fails. 

A statistically intriguing feature of those methods is 
that they customarily require—for greater efficiency— 
replacing the data with (much) smaller-dimension sum- 
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marieiQor summary statistics, because of the complex¬ 
ity of the former. In almost every case calling for ABC, 
those summaries are not sufficient statistics and the 
method thus implies from the start a loss of statistical 
information, at least at a formal level, since relying on 
the raw data is out of the question and therefore the ad¬ 
ditional information it provides is moot. This imposed 
reduction of the statistical information raises many rel¬ 
evant questions, from the choice of summary statistics 
( |Blum et al. 20131 to the consistency of the ensuing 
inference ( Robert et al.pOlT ). 

Although it has now diffused into a wide range of 
applications, the technique of Approximate Bayesian 
Computation (ABC) was first introduced by and for 
population genetics (Tavare et al.|1997 Pritchard et al. 


1999) to handle ancestry models driven by Kingman’s 


coalescent and with strictly intractable likelihoods (Beau¬ 


mont 


2010). The likelihood function of such genetic 
models is indeed “intractable” in the sense that, while 
derived from a fully defined and parameterised proba¬ 
bility model, this function cannot be computed (at all or 
within a manageable time) for a single value of the pa¬ 
rameter and for the given data. Bypassing the original 
example to avoid getting mired into the details of pop¬ 
ulation genetics, examples of intractable likelihoods in¬ 
clude densities with intractable normalising constants, 
i.e., f(y\9) — g(y\9)/Z(9) such as in Potts ( Potts|1952 | 
and auto-exponential (Besag 19721 models, and pseudo¬ 
likelihood models (Cucala et al.||2009). 


Example 1 A very simple illustration of an intractable 
likelihood is provided by Bayesian inference based on 
the median and median absolute deviation statistics 
of a sample from an arbitrary location-scale family, 

Ui> ■ ■ ■ iUn ~ <7~ 1 g(<7~ 1 {?/ — m})i as the joint distribu¬ 
tion of this statistic is not available in closed form. ◄ 


The concept at the core of ABC methods can be 
seen as both very naive and intrinsically related to the 
foundations of Bayesian statistics as inverse probabil¬ 
ity (Rubin 19841. This concept is that data y simu¬ 
lated conditional on values of the parameter close to the 
“true” value of the parameter should look more similar 
to the actual data yo than data y simulated conditional 
on values of the parameter far from the “true” value. 
ABC actually involves an acceptance/rejection step in 


1 Maybe due to their initial introduction in population ge¬ 
netics, the oxymoron ‘summary statistics’ is now prevalent 
in descriptions of ABC algorithms, included in the statistical 
literature, where the (linguistically sufficient) term ‘statistic’ 
would suffice. 


that parameters simulated from the prior are accepted 
only when 


p(y,yo) < 


where p(-, •) is a distance and e > 0 is called the toler¬ 
ance. It can be shown that the algorithm exactly sam¬ 
ples the posterior when e = 0, but this is very rarely 
achievable in practice (Grelaud et al. 2009). An algo¬ 
rithmic representation is as follows: 


Algorithm 2 ABC (basic version) 
for t = 1 to N do 

repeat 

Generate 9* from the prior 7r(-) 
Generate y* from the model f(-\9*) 
Compute the distance p(y°,y*) 
Accept 0* if p(y°,y*) < e 
until acceptance 
end for 

return N accepted values of 9* 


Calibration of the ABC method in Algorithm [2] in¬ 
volves selecting the distance p{-, ■) and deducing the tol¬ 
erance from computational cost constraints. However, 
in realistic settings, ABC is never implemented as such 
because comparing raw data to simulated raw data is 


rarely efficient, noise dominating signal (see, e.g., Marin 


et al. (2011) for toy examples). It is therefore natural 


that one first considers dimension-reduction techniques 
to bypass this curse of dimensionality. For instance, 
if rudimentary estimates S(y) of the parameter 9 are 
available, they are good candidates. In the ABC liter¬ 
ature, they are called summary statistics, a term that 
does not impose any constraint on their form and hence 
leaves open the question of performance, as discussed in 
|Marin et al. (2011); Blum et al. (2013). A more practi¬ 
cal version of the ABC algorithm is shown in Algorithm 
[3] below, with a different output for each choice of the 
summary statistic. We stress in this version of the al¬ 
gorithm the construction of the tolerance e as a quan¬ 
tile of the simulated distances p{S(y°), S(y^ t:> )), rather 
than an additional parameter of the method. 


Algorithm 3 ABC (version with summary) 

for t = 1 to N re f do 

Generate 9 < - t ' ) from the prior 7r(-) 

Generate y (t ) from the model /(jfb 41 ) 

Compute d t = p(S(y°),S(y (t) )) 

end for 

Order distances d(i) < d( 2 ) < ... < d^ Nrsf j 
return the values associated with the k smallest dis¬ 
tances 
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An immediate question about this approximate al¬ 
gorithm is how much it remains connected with the 
original posterior distribution and in case it does not, 
where does it draw its legitimacy. A first remark in this 
connection is that it constitutes at best a convergent 
approximation to the posterior distribution n(9\S(yo)). 
It can easily be seen that ABC generates outcomes from 
a genuine posterior distribution when the data is ran¬ 
domised with scale e (Wilkinson 2013 Fearnhead and 


Prangle||2012 ). This interpretation indicates a decrease 


in the precision of the inference but it does not provide 
a universal validation of the method. A second perspec¬ 
tive on the ABC output is that it is based on a non- 
parametric approximation of the sampling distribution 


a purely Bayesian nonparametric analysis of this aspect 
has not yet emerged, this brings an additional if cau¬ 
tious support for the method. 

Example 2 Continuing from the previous example of a 
location-scale sample only monitored through the pair 
median plus mad statistic, we consider the special case 
of a normal sample yi,...,y n ~ N(p,a 2 ), with n = 
100. Using a conjugate prior p ~ A/ r (0,10), cr~ 2 ~ 
Ga( 2, 5), we generated 10 6 parameter values, along with 
the corresponding pairs of summary statistics. When 
creating the distance p(-, •), we used both following ver¬ 
sions: 


Pl(S(y°),S(y)) = l med (y°) —med (y l/mad(med( Y)) 

_|_ |mad(y°)-mad(y|/ mad ( mad (Y)) 
P2(S(y°),S(y)) = l med (y°)- med (yl/mad(med(Y)) + 


| logmad(y°)-logmad(y|/ mad ( logmad (Y)) 
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2014 


0 10 20 


1.5 2.0 2.5 


1 


1.5 2.0 2.5 3.0 3.5 


Fig. 1 Comparison of the posterior distributions on p (left) 
and cr (right) when using an ABC algorithm [5] with distance 
pi (top) and p 2 (central), and when using a standard Gibbs 
sampler (bottom). All three samples are based on the same 
number of subsampled parameters. The dataset is a Af( 3, 2 2 ) 
sample and the tolerance value e corresponds to a = .5% of 
the reference table. 
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where the denominators are computed from the refer¬ 
ence table in order to scale the components properly. 
Figure [T] shows the impact of the choice of this dis¬ 
tance, but even more clearly the discrepancy between 
inference based on the ABC and the true inference on 
(/qer 2 ). 

The discrepancy can however be completely elimi¬ 
nated by post-processing: Figure [2] reproduces Figure [l] 
by comparing the histograms of an ABC sample with 
the version corrected by Beaumont et al.’s (j2002j) local 
regression, as the latter is essentially equivalent to a 
regular Gibbs output. ◄ 



Fig. 2 Comparison of the posterior distributions on p (left) 
and cr (right) when using an ABC algorithm [ 3 ] with distance 
pi (top), a post-processed version by Beaumont et al. ’s J2002) 
local regression (central), and when using a standard Gibbs 
sampler (bottom). The simulation setting is the same as in 
Figure [T] 


Barber et al. (2015) studies the rate of convergence 


for ABC algorithms through the mean square error 
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when approximating a posterior moment. They show 
the convergence rate is of order 0(n 2//<!+4 ), when q is 
the dimension of the ABC summary statistic, associ¬ 
ated with an optimal tolerance in O(n -1 / 4 ). Those rates 
are connected with the nonparametric nature of ABC, 
as already suggested in the earlier literature: for in¬ 
stance, Blum (2010), who links ABC with standard 


kernel density non-parametric estimation and find a tol¬ 
erance (re-expressed as a bandwidth) of order n~^ q+i 


and an rmse of order 2 /q+4 as well, while Fearnhead and 


Prangle (20121 obtain similar rates, with a tolerance of 
order n~ 1/,q+2 for noisy ABC. See also 


Cal vet and Czel- 


lar (20141. Similarly, Biau et al. (2014) obtain precise 


convergence rates for ABC interpreted as a fc-nearest- 
neighbour estimator. 


Lee and Latuszynski (2014) have also produced pre¬ 


cise characterisations of the geometric ergodicity or lack 
thereof of four ABC-MCMC algorithms: 

1 . the standard ABC-MCMC (with N replicates of the 
simulated pseudo-data to each simulated parameter 
value), 

2 . versions involving simulations of the replicates re¬ 
peated at the subsequent step, 

3. use of a stopping rule in the generation of the pseudo 
data, and 

4. a “gold-standard algorithm based on the (unavail¬ 
able) measure of an e ball around the data. 

Based a result by Roberts and Tweedie (1996b), also 
used in Mengersen and Tweedie (1996), namely that 


an MCMC chain cannot be geometrically ergodic when 
there exist almost-absorbing states, they derive that 
(under some technical assumptions) the first two ver¬ 
sions above cannot be variance-bounding (i.e., that the 
spectral gap is zero), while the last two versions can be 
both variance-bounding and geometrically ergodic un¬ 
der some appropriate conditions on the prior and the 
above ball measure. This result is thus rather striking 
in simulating a random number of auxiliary variables 
is sufficient to produce geometric ergodicity. We note 
that this result does not contradict the parallel result 
of Bornn et al. (2014), who establish that there is no 
efficiency gain in simulating IV > 1 replicates of the 
pseudo-data, since there is no randomness involved in 
that approach. However, the latter result only applies 
to functions with finite variances. 

When testing hypotheses and selecting models, the 
Bayesian approach relies on modelling hypotheses and 
model indices as part of the parameter and hence ABC 
naturally operates as this level as well, as demonstrated 
in Algorithm [4] following Cornuet et al. (2008), Gre- 


laud et al. (2009) and Toni et al. (2009). In fields like 


population genetics, model choice and hypotheses val¬ 
idation is presumably the primary motivation for us- 


ing ABC methods as exemplified in Belle et al. (2008); 

Cornuet et al. (2010); Excoffier et al. 

(2009); Ghirotto 

et al. (20L 

)); Guillemaud et al. ( 

2009); 

Leuenberger and 

Wegmann 

(2010); 

Patin et al. 

(2009); Ramakrishnan 

and Hadly 

(2009); 

Verdu et al. 

(2009); Wegmann and 

Excoffier ( 

2010). It is also the area that attracts most 


of the criticisms addressed against ABC: while some 


are easily dismissed (see, e.g., 

Templeton 

2008 

2010 

Beaumont et al. 

2010; Berger et al. 

2010 ), 

the impact 


of the choice of the summary statistics on the value of 
the posterior probability remains a delicate issue that 


prompted Pudlo et al. (2014) to advocate the alterna¬ 


tive use of a posterior predictive error. 


Algorithm 4 ABC (model choice) 
for i = 1 to N do 

Generate fflt from the prior tt(A4 = rn) 

Generate 8<jyi from the prior 
Generate y from the model f<m{y\9wi) 

Compute the distance p{S(y), S'(yo)} 

Set OrtW = an and 0™ = 6 m 
end for 

return the values 3JtW associated with the k smallest 
distances 


Indeed, Robert et al. (2011) pointed out the po¬ 
tential irrelevance of ABC-based posterior probabili¬ 
ties, due to the possible ancilarity (for model choice) of 


summary statistics, as also explained in Didelot et al. 


(2011). Marin et al. (2014) consider for instance the 
comparison of normal and Laplace fits on both nor¬ 
mal and Laplace samples and show that using sample 
mean and sample variance as summary statistics pro¬ 
duces Bayes factors converging to values near 1, instead 
of the consistent 0 and +oo. 


Marin et al. (2014) analyses this phenomenon with 


the aim of producing a necessary and sufficient consis¬ 
tency condition on summary statistics. Quite naturally, 
the summaries that are acceptable must display differ¬ 
ent behaviour under both models, in the guise of ranges 
of means E e [5'(y 0 )] that do not intersect for the two 
models. (In the counter-example of the normal-Laplace 
test, the expectations of the sample mean and variance 
can be recovered under both models.) This character¬ 
isation then leads to a practical asymptotic test vali¬ 
dating summary statistics and to the realisation that 
a larger number of summaries helps in achieving this 
goal (while degrading the estimated tolerance). More 
importantly, it shows that the reduction of information 
represented by an ABC approach may prevent discrimi¬ 
nating between models, at least when trying to recover 
the Bayes factor. In the end, this is a natural conse¬ 
quence of simplifying the description of both the data 
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and the model, and can be found in most limited infor¬ 
mation settings. 


3.2 More fish in the alphabet soup 


Besides ABC, approximation techniques have spread 
wide and far towards analysing more complex or less 
completely defined models. Rather than a confusion, 
this multiplicity of available approximations is benefi¬ 
cial both to the understanding of the underlying model 
and to the calibration of those different methods. 

Variational Bayes methods have been proposed for 
at least two decades to substitute exponential families 


g(6*|A) for complex posterior distributions w(6) (Jordan 


et al.|[l999 MacKay||2002 ). The central notion in those 
methods is that the exponential family structure and 
a so-called mean-field representation of the approxima¬ 
tion 

k 

q(e |A) = n<&(W 

i=1 

allows for a sometimes closed-form minimisation of the 
Kullback-Leibler distance KL(g(0| A), 7r(0)) between the 
true target and its approximation. If not, the setting is 


quite congenial to the use of EM algorithms (Paisley 


et al. 2012). See Salimans and Knowles (2013) for a 


contemporary view on this approach, which offers con¬ 
siderable gains in terms of computing time, while be¬ 
ing difficult to assess in terms of discrepancy with the 
“truth”, i.e., the outcome that would result from using 
the genuine posterior. 

Another approach that has met with considerable 
interest in the past five years is Integrated nested Laplace 


approximation (INLA) (Rue et al. 2009). The method 
operates on latent Gaussian random fields, with likeli¬ 
hoods of the form 




i=1 

where the x^s are the observables and the rji s are la¬ 
tent variables. Using Laplace approximations to the 
marginal distributions 7r(f?|x 0 ) and to /(7j|xo), INLA 
produces fast and accurate approximations of the true 
posterior distribution as well as of the marginal likeli¬ 
hood value. Thanks to the availability of a well-constructed 
package called R-INLA, this approach has gathered a 
large group of followers. 

A somewhat exotic example of variational approxi¬ 
mation is expectation-propagation (EP) ( Minkap OOl ), 
which starts from an arbitrary decomposition of the 
target distribution 


= YlvM 


j =i 


(often inspired by a likelihood decomposition into groups 
of observations) and iteratively approximate each term 
7 Tj in the product by a density member of an exponen¬ 
tial family, z/(-|A)m using the other approximations as 
a marginal. Given the current approximation of tt(9) at 
iteration t, 

k 

3 =1 

where At is the current value of the hyperparameter, 
the i-th step in the expectation-propagation (EP) algo¬ 
rithm goes as follows: 


1. Select 1 < j < k at random 

2. Define the marginal 


V-j(9\X t ) oc 


v{0\X t ) 

"M At) ! 


3. Update the hyperparameter X t by solving 


At+i = argminKL {'Kj(6)v_j{9\\ t ), u(9\X)} 

A 

4. Update vj{9 |A t ) as 


Vj{9 |A t+ i) oc 


Vk) 
y-M ^)' 


(In the above, KL denotes the Kullback-Leibler diver¬ 
gence.) The algorithm stops at stationarity. The con¬ 
vergence of this approach is not yet fully understood, 
but Barthelme and Chopin (2014) consider expectation- 
propagation as a practical substitute for ABC, avoiding 
the selection of summary statistics by using a local con¬ 
straint 


\Xi — x 


,obs | 


< e 


on each element of the simulated pseudo-data vector, 
a: obs being the actual data. In addition, expectation- 
propagation provides an approximation of the evidence. 
In the ABC setting, when using a Normal distribu¬ 
tion as the exponential family default, implementing 
EP means computing empirical mean and empirical 
variance, one observation at a time, under the above 
tolerance constraint. Obviously, using a Normal candi¬ 
date means that the final approximation will also look 
much like a Normal distribution, which both links with 
other Normal approximations like INLA and variational 
methods, and signals a difficulty with EP in less smooth 
cases, such as ridge-like or multimodal posteriors. 

While different approximations keep being devel¬ 
oped and tested, with arguments ranging from efficient 
programming, to avoiding simulations, to having an 
ability to deal with more complex structures, their draw¬ 
back is the overall incapacity to assess the amount of 
approximation involved. Bootstrap evaluations can be 
attempted in the simplest cases but cannot be extended 
to more realistic situations. 
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4 Optimisation in modern Bayesian 
computation 


Optimisation methodology for high-dimensional maxi- 
mum-a-posteriori (MAP) estimation is another area of 
Bayesian computation that has received a lot of atten¬ 
tion over the last years, particularly for problems re¬ 
lated to machine learning, signal processing and com¬ 
puter vision. One reason for this is that for many Bayesian 
models optimisation is significantly more computation¬ 
ally tractable than integration. This has generated a 
lot of interest in MAP estimators, especially for ap¬ 
plications involving very high-dimensional parameter 
spaces or tight computing time constraints, for which 
calculating other summaries of the posterior distribu¬ 
tion is not feasible. Here we review some of the major 
breakthroughs in this topic, which originated mainly 
outside the statistics community. We focus on devel¬ 
opments related to high-dimensional convex optimisa¬ 
tion, though many of the techniques discussed below are 
also useful for non-convex optimisation. In particular, 
in Section [44] we concentrate on proximal optimisation 
algorithms , a powerful class of iterative methods that 
exploit tools from convex analysis, monotone operator 
theory and theory of non-expansive mappings to con¬ 
struct carefully designed fixed-point schemes. We refer 
the reader to the excellent book by |Bausc hke and Conr] 
bettes (2011) for the mathematics underpinning proxi¬ 


mal optimisation algorithms, and to the recent tutorial 


papers by Combettes and Pesquet (2011), Cevher et al. 


(2014) and Parikh and Boyd (2014) for an overview of 


the field and applications to signal processing and ma¬ 
chine learning. 

However, we do think it is vital to insist that, at 
the same time as asserting that modern optimisation 
methodology represents a much-underused opportunity 
in Bayesian inference, in its raw form it inevitably fails 
to deliver essential elements of the Bayesian paradigm. 
The vision is not to deliver a point estimate of an un¬ 
known structure, but the full richness of Bayesian in¬ 
ference in its coherence, its proper treatment of uncer¬ 
tainty, its intrinsic treatment of model uncertainty, and 
so on. Bayesian statistics does not boil down to optimi¬ 


sation with penalisation (Lange et al. 2014). We need 


to express the uncertainty associated with decisions and 
estimation, stemming from the stochastic nature of the 
data, and our lack of knowledge about relevant mecha¬ 


nisms. 


The challenge is to use the awesome capacity of fast 
optimisation in a high-dimensional parameter space to 
focus on local regions of that space where a combina¬ 
tion of analytic and numerical investigation can deliver 
at least approximations to full posterior distributions 


and derived quantities. The community has barely risen 
to this challenge, with only isolated examples such as 
the discussion in Green (2015) of a problem in unla¬ 
belled shape analysis. However, the growing community 
of INLA (Rue et al. 2009) users may bring an height¬ 
ened awareness of such possibilities, along with its effi¬ 
cient code (Schrodle and Held 2011 Muff et ah|2013 ). 
Another promising research area is to use mathematical 
and algorithmic tools from convex optimisation to de¬ 
sign more efficient high-dimensional MCMC algorithms 
( Pereyra||2015 1. 


4.1 Proximal algorithms 


Similarly to many other computational methodologies 
that are widely used nowadays, proximal algorithms 


were first proposed several decades ago by Moreau (1962), 


Martinet (1970) and Rockafellar (19761, and regained 


attention recently in the context of large-scale inverse 
problems and “big data”. 

We consider the computation of maximisers of pos¬ 
terior densities n(9) = exp {—g(9)}/n that are high- 
dimensional and log-concave, which we formulate as 


9 map = argmin g(9) 
6eR n 


( 8 ) 


where g belongs to the class /b(K n ) of lower semicon- 
tinuous convex functions from R” —> (—oo,+oo]. No¬ 
tice that g may be non-differentiable and take value 
g(9) = +oo, reflecting constraints in the parameter 
space. In order to introduce proximal algorithms we 
first recall the following standard definitions and results 
from convex analysis: We say that ip £ R" is a subgra¬ 
dient of g at 9 £ R” if it satisfies (u — 0) T ip + g{9) < 
g(u),Vu £ R". The set of all such subgradients defines 
the subdifferential set dg(9), and 9 map is a nrinimiser 
of g if and only if 0 G dgi&MAP)■ The (convex) conju¬ 
gate of g £ /o(R n ) is the function g* £ /o(R n ) defined 
as g*(p>) = sup ueR „ u 7 \p — g( u). The subgradients of g 
and g* satisfy the property p> £ dg(9) dg*{p>). 

Proximal algorithms take their name from the prox¬ 
imity mapping, defined for g £ /o(R") and A > 0 as 
(Moreau 1962) 

proXg(0) = argmin g(u) + ||u — 0|| 2 /2A. (9) 

uei" 


In order to gain intuition about this mapping it is use¬ 
ful to analyse its behaviour when A £ R + is either very 
small or very large. In the limit A —> oo, the quadratic 
penalty term vanishes and Q maps all points to 9 map- 
In the opposite limit A —> 0, © becomes the iden¬ 
tity operator and maps 9 to itself. For finite values of 
A, proXg(<?) behaves similarly to a gradient mapping 
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and moves points in the direction of 9 map- Like gradi¬ 
ents, proximity mappings have several properties that 


are useful for devising fixed-point methods (Bauschke 
and Co mbettes| |2011). 


Property 1: The proximity mapping of g is related to 
its subdifferential by the inclusion {9 — proXg(0)}/A € 
<9g{proXg(0)}, which collapses to V.g{proXg(0)} when 
g G C 1 . As a result, for any A > 0, the minimiser of g 
verifies the fixed-point equation 9 = pro Xg(9). 

Property 2: Proximity mappings are firmly non-expansive 
that is, || proXg(0) —proXg(u)|| 2 * < (9 — u) 7 {proXg(0) — 
prox^(u)}, V0, u G R ra . 

Property 3: The proximity mappings of g and its con¬ 
jugate g* are related by Moreau’s decomposition for¬ 
mula: 9 = proXg(0) + AproXg{ A (0/A). 

The simplest proximal method to solve Q is the 
proximal point algorithm given by the iteration 


9 k 


= pro x x J9 k ). 


( 10 ) 


Every sequence {9 k } ke j$ produced by this algorithm 
converges to 9 map, even if proximity mappings are 
evaluated inexactly, as long as the errors are of cer¬ 
tain types (e.g., summable). A more general proximal 
point algorithm includes relaxation, i.e., 


9 k 


= (1 - a k )9 k +a k proxg(0 fc ), a k £ (0,2), 


and with over-relaxation (i.e., a k G (1,2)) often con¬ 


verges faster than (101. Notice from Property 1 that 


(10) can be interpreted as an implicit (backward) sub¬ 
gradient steepest descent to minimise g , i.e., 9 k+1 = 
9 k — A ip, with ip G dg ^0 fc+1 ^. Alternatively, proximal 
point algorithms can also be interpreted as explicit (for¬ 
ward) gradient steepest descent to minimise the Moreau 
envelope of g, e\{9) = inf ue R»> g( u) + ||u — 0|| 2 /2A, a 
convex lower bound on g that by construction is con¬ 
tinuously differentiable and has the same minimiser as 
9- 

Proximal point algorithms may appear of little rel¬ 
evance because evaluating prox^ can be as difficult as 
solving ([8]) in the first place (notice that ®> is a con- 
vex minimisation problem similar to Q). Surprisingly, 
many advanced proximal optimisation methods can in 
fact be shown to be either applications of this simple 
algorithm, or closely related to it. 

Most proximal methods operate by splitting g , e.g., 


9 map = argmin{gi(0) + g 2 {0)}, 

o GR" 


( 11 ) 


such that g\ G To(M") and g 2 G rb(R”) have gra¬ 
dients or proximity mappings that are easy to com¬ 
pute or approximate. For example, for many Bayesian 


models it is possible to find a decomposition g{9) = 
gi{9) + g 2 (9) such that g\ is /TLipschit^] differentiable 
and g 2 G Ib(M n ), possibly non-differentiable, has a 
proximity mapping that can be computed efficiently 
with a specialised algorithm. This decomposition is use¬ 
ful for instance in linear inverse problems, where g\ is 
often related to a Gaussian observation model involv¬ 
ing linear operators and g 2 to a log-prior promoting a 
parsimonious representation (e.g., sparsity on some ap¬ 
propriate dictionary, low-rankness) or enforcing convex 
constraints (e.g., positivity, positive definiteness). For 
models that admit this decomposition, it is possible to 
compute 9map efficiently with a forward-backward al¬ 
gorithm, also known as the proximal gradient algorithm 


9 


k+1 


= prox^"(0 fc - A„V<7i(0 fc )). 


( 12 ) 


For \ n = X G (0,1//3) the objective function g(9 k ) 
converges to g(9 M Ap) with rate 0(l/k). If the value of 
the Lipschitz constant (3 is unknown X n can be found 
by line-search. 

A remarkable property of ( |T2| is that it can be accel¬ 
erated to converge with rate 0(l/k 2 ), which is optimal 
for this class of problems ( |Nesterov]|2004 ). This can be 
achieved for instance by introducing an extrapolation 
step 


9+ = 9 k +uj k (9 k -9 k ~ k ), 

9 k+1 = prox^" 1 (0+ - p~ 1 Vg 1 (9 + )), 


(13) 


where {ui k } k& ^ is an appropriate sequence of extrap¬ 
olation parameters. It was noticed by |Combettes and| 
Pesquet (2011) that several important convex optimi¬ 


sation algorithms can be derived as applications of the 
forward-backward algorithm, for example the projected 
gradient algorithm for minimising a Lipschitz differen¬ 
tiable function subject to a convex constraint (in this 
case the proximity mapping reduces to a projection 


onto the convex set). Notice that (12) can be inter¬ 


preted as an implementation of the proximal point iter¬ 
ation (10) where pro yfl{9 k ) is approximated by replac¬ 


ing gi with its first order Taylor series approximation 
around the point 9 k . 

Moreover, in some cases it may be more efficient 


to compute 9map by solving the dual of (11), for in¬ 


stance if g admits a decomposition g{9) = g±(9) + 
g 2 (L9) for some linear operator L G R" xp , gi G /b(R”) 
strongly convex and g 2 G I}i(R p ) with efficient proxim¬ 
ity mapping. In this case, the Fenchel-Rockafellar the¬ 
orem states that 9 map can be computed by solving the 

2 gi E C 1 has /3-Lipschitz continuous gradient if ||V<?i(0) — 

V< 7 i(u)|| < /3\\6 — u||, V(0, u) E S. N x R N 
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dual problem (Bauschke and Com bettesp Ol 1 ch. 19) 

= argrnin g* {—L T if) + g^(if) (14) 

i/>eRp 

and setting 9 map = V< 7 i(— L T if*). This p-dimensional 
problem can be solved iteratively with a forward-backward 
algorithm xf k+1 = prox^f 1 (xf k — A n 'Vg*(—L T tf k )) that 
can also be accelerated to converge with rate 0(l/k 2 ), 
and where we note that the proximity mapping of g 2 is 
typically evaluated by using Property 3, and that the 
strong convexity of g\ implies Lipschitz differentiability 


of gl. Computing 9 map via (14) can lead to important 


computational savings, in particular if p -C n or if g 2 
is separable and has a proximity mapping that can be 
computed in parallel for each element of 0 (this is gen¬ 
erally not possible for g 2 o L). We refer the reader to 


Komodakis and Pesquet (2014) for an overview of re¬ 


cent dual and primal-dual algorithms and guidelines for 
parallel implementations. 

Another important proximal optimisation method 
is the Douglas-Rachford splitting algorithm given by 


G k+ ? = prox* i (0 fe ), 


e k+ 1 = e k - + proxi, (20 fe+ 3 - e k ). 


(15) 


From a theoretical viewpoint this algorithm is more 
general than the forward-backward algorithm because 
it does not require g\ or g 2 to be continuously differen¬ 
tiable. However, its practical application is limited to 
problems for which both g\ and g 2 have efficient prox¬ 
imity mappings. Similarly to the forward-backward al¬ 


gorithm, (15) includes many proximal algorithms that 


been proposed in the literature for specific models, and 
can also be interpreted as an application of the proxi¬ 
mal point algorithm. 

The proximal method that is arguably most widely 
used in Bayesian inference is the alternating direction 
method of multipliers (ADMM), which operates by for¬ 
mulating © as a constrained optimisation problem 


argmin g 1 (9) + g 2 (z) 

06*™, zgK 71 

subject to 9 = z, 


(16) 


and then using augmented Langrangian techniques to 
express (16) as an unconstrained saddle point problem 
with saddle function g\{9) + g 2 (z) + Xp T (9 — z) + \\9 — 
z|| 2 /2A ( Boyd et al.|2011 ). ADMM solves this problem 
with the iteration 


9 k+1 = prox* (z fc - V fc ), 

z fc+1 = proXg 2 (0 fc+1 + ip k ), (17) 

p k+1 = tp k + e k+1 -z k+1 , 


that also involves the proximity mappings of g± and g 2 . 
This basic ADMM iteration can be tailored to specific 
models in many ways (e.g., to exploit decompositions of 
the form < 7 i = g\ o Li and g 2 = g 2 o L 2 so that proximal 
updates can be performed in parallel for all components 
of 9 , z and ip). Interestingly, ADMM can be interpreted 
as an application of the Douglas-Rachford algorithm to 


the dual of (16), and is therefore also a special case of 


the proximal point algorithm. For more details about 
the ADMM algorithm, see the recent tutorial by |Boyd[ 
etaLlpoTTl). 


Furthermore, an important characteristic of proxi¬ 
mal optimisation algorithms is that they can be mas¬ 
sively parallelised to take advantage of parallel com¬ 
puter architectures. Suppose for instance that g ad¬ 
mits the decomposition g(9) = Xim=i 9m(L m 9) with 
g m £ r(R Pm ) and L m £ R” X;Pm such that the mappings 
of g m are easy to compute and Q = J2m=i is in¬ 

vertible. Then, in a manner akin to (16), we express ([8| 
as 




El VI 

gm{ z m) 

m—1 

subject to z m = L m 0, Vra = 1,..., M, 


argmin 

ziGK 71 ,..., z M e® 


(18) 


and compute 9map with the following iteration 

0 fc+1 = Q" 1 £ M Ll(z k m -p k J, 

z m +1 = P r0 x g m (L m 9 k+1 - p k m ), Vm = 1,..., M, ( 19 ) 

rtt 1 =vt + Lm9 k+1 -Z k m +1 , Vm = 1,..., M, 

that can be parallelised with factor M at a coarse level 
(e.g., on a multi-processor system). Further parallelisa¬ 
tion may be possible at a finer scale (e.g., on a vectorial 
processor such as GPU or FPGA) by taking advantage 
of the structure of pi'OXg m or by using specialised al¬ 
gorithms. This algorithm, known as the simultaneous 
direction method of multipliers , is also closely related 
to the ADMM, Douglas-Rachford and proximal point 
algorithms. Notice that splitting g not only allows the 
exploitation of parallel computer architectures, but may 
also significantly simplify the computation of proximity 
mappings; often prox^ m has a closed-form expression. 
Lastly, it is worth mentioning that there are other mod¬ 
ern proximal optimisation algorithms that can be mas¬ 
sively parallelised, for example the generalised forward 


backward algorithm ( 

Raguet et al. 

2013 

), the paral- 

lei proximal algorithms ( 

Combettes and Pesquet 2008 

Pesquet and Pustelnik 2012), and the parallel primal- 

dual algorithm (Combettes and Pesquet 2012 

)■ 


Finally, main current topics of research in proximal 
optimisation include theory and methodology for: 1) 
randomised and stochastic algorithms that operate with 
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estimators of gradients and proximity mappings to re¬ 
duce computational complexity and allow for errors in 
the update rules, 2 ) adaptive and variable metric algo¬ 
rithms (e.g. Riemannian and Newton-type) that exploit 
the model’s geometry to improve convergence speed, 
and 3) proximal methods for non-convex problems. We 
anticipate that in the future new and stronger connec¬ 
tions will emerge between proximal optimisation and 
stochastic simulation, in particular through develop¬ 
ments in stochastic optimisation and high-dimensional 
MCMC sampling. For example, one connection is through 
the integration of modern stochastic convex optimisa¬ 


tion and Markovian stochastic approximation (Com- 


bettes and Pesquet 

2014 

Andrieu et al. 

2015) 


of proximal optimisation and high-dimensional MCMC 
sampling (Pereyra]2015). 


4.2 Convex relaxations 


Modern proximal optimisation was greatly motivated 
by important theoretical results on the recovery of partially- 
observed sparse vectors and low-rank matrices through 


Tao 


convex minimisation (Candes et al. 2006 Candes and 


2009) and on compressive sensing (Candes and 


Wakin|2008 ). A key idea underlying these works is that 


of approximating a combinatorial optimisation prob¬ 
lem, whose solution is NP-hard, with a “relaxed” con¬ 
vex problem that is computationally tractable, and whose 
solution is in some sense close to the solution of the orig¬ 
inal problem. Reciprocally, the development of mod¬ 
ern convex optimisation has in turn generated much 
interest in log-concave models, convex regularisers, and 
“convexifications” (i.e., convex relaxations for intractable 
or poorly tractable models) for statistical inference prob¬ 
lems involving high-dimensionality, large datasets and 


computing time constraints (Chandrasekaran et al. 2012 


Chandrasekaran and Jordan|2013). 


hierarchical Bayesian model (Oliveira et al.|[2009| 


/(y|0) = (27Rj 2 )" n/2 exp{—||y - H0\\l/2cr 2 }, 
n(6\a) oc a~ n exp (-a|| V d 0||i_ 2 ), ( 20 ) 

7r(a) = e -Q l R+ (a), 


where 7r(0|a) is the (improper) total-variation Markov 
random field, || • ||i _2 denotes the composite ^ 1—^2 norm 
and Vd is the discrete gradient operator that computes 
the vertical and horizontal differences between neigh¬ 
bour image pixels. This prior is log-concave and mod¬ 
els the fact that differences between neighbouring im¬ 
age pixels are usually very small but occasionally take 
large values; it is arguably the most widely used prior 
in modern statistical image processing. The values of 
H and a 2 are typically determined during the system’s 
calibration process and are here assumed known. 

We compute the MAP estimator of 6 associated 
with the marginal posterior 7r(0|y) = / 0 °° 7r(0, a|y)da, 
which is unimodal but not log-concave, 


9 map = argrnin 

0SR" 


\\y ~ H0\\ 2 /2a 2 
+ (n + 1) log (||V d 0||i _ 2 + 1) • 


( 21 ) 


Problem (21) is not convex, but can nevertheless be 
solved efficiently with proximal algorithms by using a 
majorisation-minimisation strategy. To be precise, start¬ 
ing from some initial condition 9 l -°\ e.g., 9 ^ = y, we 
iteratively minimise the following sequence of strictly 
convex majorants (Oliveira et al.|2009) 


0 (t+1) = argrnin ||y - H0\\l/2cr 2 + o^||V d 0||i_ 2 , 

9eR n 

with = (n+ l)(||V d 0 (t) ||i _ 2 + 1). 

( 22 ) 


4.3 Illustrative example 

For illustration, we show an application of proximal op¬ 
timisation to Bayesian image resolution enhancement. 
The goal is to recover a high-resolution image 9 £ 
from a blurred and noisy observed image y ~ M{H9 , 
a 2 In), where H £ K raxn is a linear operator represent¬ 
ing the blur point spread function of the low resolu¬ 
tion acquisition system and cr 2 is the system’s noise 
power. This inverse problem is ill-posed, a difficulty that 
Bayesian image processing methods address by exploit¬ 
ing prior knowledge about 9. Here we use the following 


Iteration (22) involves a convex subproblem that can 


easily be solved using most modern proximal optimisa¬ 
tion techniques. For example, here we use the state-of- 
the-art ADMM algorithm SALSA ( Afonso et al.pOlT ) 


implemented with gi(9) = ||y — H9\\ 2 /2a 2 1 < 72 ( 11 ) = 
&eff\\ V(ju||i_ 2 , and the constraint 0 = u [though we 
could have also used other modern algorithms (Pes- 


quet and Pustelnik| 2012; [Combettes and Pesquet|2012[ 


Raguet et al. 2013)]. To compute the proximity map¬ 
ping of gi we use the fact that H is block-circulant to 
compute matrix products and pseudo-inverses with the 
FFT algorithm. We compute the proximity mapping 
of 32 with a highly parallelised implementation of the 


specialised algorithm of Chambolle (2004). 
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Fig. 4 Resolution enhanced image 6 map obtained b y so lv- 
ing ( |21| ) with the majorisation-minimisation strategy \22\ . 
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Fig. 5 Widths of pixel-wise 90% marginal credibility in¬ 
tervals estimated with the proximal MCMC algorithm of 
Pereyraj (2 015| ). 




Fig. 6 Convergence of the estimate 6 to Gmap vs comput¬ 
ing time (seconds). 


Figure [3] presents a blurred and noisy observation y 
of the popular “boats” image of size 512 x 512 pixels, 
generated with a uniform 9x9 blur and a noise power 
of er 2 = 0.5 2 (blurred-signal-to-noise ratio BRSN = 
10 log lo {||iy0o||l/ cr2 } = 40dB). Figure [i] below shows 
the MAP estimate 6 map obtained by solving (211 using 
4 iterations of (22) and a total of 51 ADMM iterations. 
We observe that this resolution enhancement process 
has produced a remarkably sharp image with very no¬ 
ticeable fine detail. Moreover, Figure [5] shows the mag¬ 
nitude of the marginal 90% credibility regions for each 
pixel, as measured by the distance between the 5% and 
95% quantile estimates. These estimates were computed 
using the proximal Metropolis-adjusted Langevin algo¬ 
rithm (Pereyra 2010, which is appropriate for high¬ 
dimensional densities that are not continuously differ¬ 
entiable. We observe in Figure [0 that the uncertainty is 
mainly concentrated at the contours and object bound¬ 
aries, revealing that model is able to accurately detect 
the presence of sharp edges in the image but with some 
uncertainty about their exact location. Finally, Figure 
[H] shows the convergence of the estimates pro¬ 

duced by each AD AIM iteration to 9 map (as measured 
by the mean squared error — OmapM) ■ Notice 

that computing Omap only required 10 seconds (exper¬ 
iment conducted on an Apple Macbook Pro computer 
running Matlab 2013, a C++ implementation would 
certainly produce even faster results). This is remark¬ 
ably fast given the high dimensionality of the problem 
(n = 262 144). The computation of the credibility re¬ 
gions by MCMC sampling (20 000 samples with a thin¬ 
ning factor of 1 000 to reduce the algorithm’s memory 
foot-print) required 75 hours. 
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5 Discussion 


5.1 Bayesian computation in the era of data science 


Is there a revolution taking place right now and have we 
missed the train, standing on the platform, only con¬ 
cerned with small-print on the train schedules - apart, 
that is, from the obvious but not-so-new requirement to 
handle massive datasets (and the mistakes that come 
with them)?! 

As with other areas of statistical science, the Bayesian 
computation community has to decide whether data sci¬ 
ence is an opportunity or a threat. Inevitably if we do 
not treat it as an opportunity, it will become a threat. 
Thanks to the ubiquity of “big data” (as an over-hyped 
phrase mostly useful for attracting research funding, 
but also to at least some extent in reality), a new poten¬ 
tially multi-disciplinary field of data science is rapidly 
opening up. This field is attracting huge material re¬ 
sources, and will absorb much human talent. Statisti¬ 
cal science has to be a part of this, for its own survival, 
but also for the sake of society. As Tim Harford has 
cogently argued (Harfordjj2014|: 


Recall big data’s four articles of faith. Un¬ 
canny accuracy is easy to overrate if we simply 
ignore false positives [...]. The claim that cau¬ 
sation has been “knocked off its pedestal” is fine 
if we are making predictions in a stable environ¬ 
ment but not if the world is changing [... ] or if 
we ourselves hope to change it. The promise that 
“N = All”, and therefore that sampling bias does 
not matter, is simply not true in most cases that 
count. As for the idea that “with enough data, 
the numbers speak for themselves” - that seems 
hopelessly naive in data sets where spurious pat¬ 
terns vastly outnumber genuine discoveries. 

“Big data” has arrived, but big insights have 
not. The challenge now is to solve new prob¬ 
lems and gain new answers - without making 
the same old statistical mistakes on a grander 
scale than ever. 


It is a mistake to think that Bayes has no part to 
play in these developments, but more of us need to get 
more involved, and learn new tools, as in the way the 


Consensus Monte Carlo algorithm 

(Scott et al. 

2013) 

exploits the Hadoop environment (’ 

White|2012 

) and the 

MapReduce programming model (I 

lean and Ghemawat 


20081. Another direction that can prevent a potential 


schism between Bayesian modelling and highly complex 
models is to aim for modularity and local learning, that 
it, to abandon the goal of modelling big universes for 
analysing a series of small worlds, in spite of the loss 


of coherence, amd hence compromise to the Bayesian 
paradigm, that this entails. The curious case of the 
cut models presented in Plummer (20141 is an illustra¬ 
tion of the potential for developing partial-information 
Bayesian inference tools where “small is beautiful” be¬ 
cause this is the only viable solution. 


5.2 Do we care enough about applications? 

Bayesian computation began in order to answer rather 
practical problems - how can we perform a Bayesian 
analysis of these data using this model? - or the cor¬ 
responding meta-problems - how can Bayesian analy¬ 
sis be performed generally and reliably for this class 
of models? The focus was applied methodology (al¬ 
though since the methods were new, they tended to 
be published in premier theory/methodology journals). 
Because the research community wanted to understand 
(the advantages, performance and limitations of) the 
methods they were advocating, more theoretical work 
started to be conducted, and, for example, many prob- 
abilists were attracted to study the Markov chains that 
MCMC methodologists created. The centre of mass of 
research activity drifted away from the original motiva¬ 
tions, just as has happened in other areas of mathema- 
tically-rigorous computation. 

At the same time, those working with data became 
more ambitious with regard to the scale of data, the 
complexity of modelling and the sophistication of anal¬ 
ysis, all factors that have in principle (and often in fact) 
stimulated new developments in Bayesian computation. 
But to a large extent this is a rich, self-stimulating and 
self-supporting area of research; new applications may 
or may not need new computational techniques, but 
new techniques don’t seem to need applications to jus¬ 
tify themselves. It is apposite to ask to what extent is 
cutting-edge computational methodology research re¬ 
ally delivering answers to questions that application 
domains are posing. And to what extent is cutting- 
edge computational methodology research successfully 
answering real questions? 

We may not be unanimous about answers to these 
questions, except we can probably all agree they are 
“not entirely”. We will also disagree about how much 
this matters, but again there may be something to agree 
about, that we have failed if methodological innova¬ 
tions disconnect completely from applications. Legiti¬ 
mate differences in research goals partially explain the 
trend in this direction, but it is fair to say that there 
is a big communication problem between the compu¬ 
tational statistics community and many of the com¬ 
munities where Bayesian computational methods are 
applied. Unfortunately people in these communities do 
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not always keep up with the state of the art in compu¬ 
tational statistics. At the same time, statisticians are 
often not aware of important developments arising in 
other fields. (ABC is a good illustration: it took more 
than five years of development within the population 
genetics community before statisticians became aware 
the technique existed and a few more years before they 
realised this was proper Bayesian inference applied on 
approximate models.) We can perhaps blame the fact 
that there are not enough people working at the inter¬ 
face of the different communities, but life at the inter¬ 
face is not easy because multidisciplinary and interdis¬ 
ciplinary research is often seen as “marginal” by both 
communities and is thus difficult to publish, communi¬ 
cate, etc. Then there are of course problems in dissem¬ 
ination, related to the different writing styles, journals, 
computing languages, software, etc. of each community. 

We strongly encourage those developing new tech¬ 
niques always to find a way to disseminate them in 
such a way that at least somebody else could use them, 
preferably someone without the ability to have invented 
the technique for themselves! - and advocate, of course, 
that successful dissemination be properly rewarded in 
our career structures. 


In a somewhat parallel path, we have seen over the 
past decades the emergence of new languages and meta¬ 
languages intended to handle complexity both of prob¬ 
lems and of solutions towards a wider audience of users. 
BUGS (Lunn et al. 2010) is the archetypal example of 
such languages and it has been successful to the ex¬ 
tent that a large proportion of the users has a fairly 
limited statistical background and often even less of 
a computational background. However, the population 
of BUGS users and sympathisers is tiny compared to 
that of SAS or other corporate statistical systems. In 
this respect, we have failed to disseminate concepts like 
Bayesian analysis and wonderful tools like MCMC al¬ 
gorithms, because most people are unable to turn them 
into codes by themselves. (Perusing one of the numer¬ 
ous statistics and machine-learning on-line forums like 
Cross Validated quickly exposes the methodological gap 
between academics and the masses!) It is unclear how 


novel programming developments like STAN (Stan De¬ 


velopment Team 2014) are going to modify this pic¬ 


ture, in that they still assume a decent understanding 
of both modelling and simulation issues. In that re¬ 
spect, network-based approaches as those covered by 
BUGS sound more promising towards “modelling lo¬ 
cally to learn globally”. Similarly, ABC software is ei¬ 
ther too specific, like DIYABC (Cornuet et al. 20081 
which addresses only population genetic questions, or 
too dependent on the ability of the modeller to program 
simulated outcomes from the model under study. 


5.3 Anticipating the future 


In which of the areas we discuss do we expect a partic¬ 
ular emphasis of effort, or significant progress, or do we 
see particular needs for new efforts or new directions? 

One expectation is that in the future computational 
methodologies will be more flexible and malleable. Over 
the past 25 years Bayesian modelling and inference tech¬ 
niques have been applied successfully to thousands of 
problems across a wide range application domains. Each 
application brings its own constraints in terms of model 
dimensionality and complexity, data, inferences, accu¬ 
racy and computing times. These constraints also vary 
significantly within specific applications. For example, 
in hyperspectral remote sensing, when a new Bayesian 
model is introduced it is often first explored and val¬ 
idated by MCMC sampling, then approximated with 
a variational Bayes method, and then approximated 
again so that it can be applied to gigabyte-large datasets 
by using optimisation techniques. Similarly, an interest¬ 
ing result revealed by a fast inference technique can be 
analysed more deeply with more reliable and accurate 
methods. Therefore we expect that in the future the dif¬ 
ferent main computational methodologies will become 
more adaptable and that the boundaries between them 
will be less well defined, with many algorithms devel¬ 
oped that combine simulation, variational approxima¬ 
tions and optimisation. These will be able to handle a 
wide spectrum of models, degrees of accuracy and com¬ 
puting times, as well as models that have some parts 
that are simple but high-dimensional and others that 
are more complex but that only involve low-dimensional 
components. This can be achieved by using approxima¬ 
tions and optimisation to improve stochastic sampling, 
by using simulation within deterministic algorithms to 
handle specific parts of the model that are difficult to 
compute analytically, or in completely new and original 
ways. 

We also anticipate that computational methodolo¬ 
gies will continue to be challenged by larger and larger 
datasets. There is of course a threat that the whole 
field turns into a library of machine-learning techniques, 
with limited validation on reference learning sets and a 
quick turnover of methods, which would both impov¬ 
erish the field and fail to reach a general audience of 
practitioners. We must retain a sense of the stochastic 
elements in data collection, data analysis, and inference, 
recognising uncertainty in data and models, to preserve 
the inductive strength of data science - seeing beyond 
the data we have to what it might have been, what it 
might be next time, and where it came from. 
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